It's 2:47 AM. PagerDuty fires. You open a Slack alert and see: p99 latency spike on checkout-service. You SSH into the host, check dashboards in four tabs, grep logs for the last 20 minutes, and eventually find a slow query introduced in a deploy six hours ago. It took 34 minutes. You resolved it, wrote the postmortem, and went back to sleep.
That 34-minute investigation is exactly what AI SRE is designed to compress.
What is AI SRE?
AI SRE (AI Site Reliability Engineering) refers to systems that apply AI — primarily large language models and ML-based anomaly detection — to automate or assist the core workflows of SRE: incident detection, root cause analysis, on-call triage, and runbook execution. An AI SRE system ingests your telemetry (metrics, logs, traces), understands the structure of your services, and acts as an always-on engineer that triages alerts, correlates signals, and proposes or executes remediation steps.
Unlike traditional alerting, which fires when a threshold is crossed, AI SRE interprets context: is this spike correlated with a recent deploy? Does it follow the same pattern as last week's incident? Has anything changed in upstream dependencies? That contextual reasoning — done automatically and at 3 AM — is the core value proposition.
AI SRE doesn't replace SREs. It handles the repetitive investigation work so your engineers focus on system design, capacity planning, and the incidents that actually require human judgment.
How AI SRE Differs from Traditional SRE
Traditional SRE practice — described in Google's SRE book and evolved across the industry — relies on human engineers supported by tooling. AI SRE shifts where intelligence lives in that stack.
| Dimension | Traditional SRE | AI SRE |
|---|---|---|
| Incident detection | Threshold/anomaly alerts fire; engineer investigates | AI correlates signals, suppresses noise, surfaces the root event |
| Root cause analysis | Engineer manually traces logs, metrics, deployments | AI runs correlation across all telemetry automatically |
| On-call triage | Engineer pages through dashboards at 3 AM | AI produces a structured incident summary before engineer wakes up |
| Runbook execution | Engineer follows runbook steps manually | AI executes known runbook steps (restart, scale, rollback) |
| Postmortem | Engineer writes it from scratch after resolution | AI drafts postmortem from incident timeline |
| Toil reduction | Automation requires upfront engineering | AI identifies toil patterns and suggests automation targets |
The difference isn't about removing humans from the loop. It's about changing when humans enter it — from "wake up and start investigating" to "wake up, read the AI summary, make a decision."
For a deeper look at how SRE practice compares to DevOps and platform engineering, see SRE vs DevOps and DevOps vs SRE vs Platform Engineering.
Core Capabilities of an AI SRE System
Automated Root Cause Analysis
RCA is where AI SRE provides the clearest value. A production incident typically involves correlated signals across metrics, logs, and traces — a latency spike in service A that traces back to a slow query in service B, caused by a lock contention introduced in a migration deployed at 6 PM.
Connecting those dots manually takes 20–40 minutes. An AI RCA system does it by:
- Identifying the blast radius — which services are affected, based on service topology
- Correlating the timeline — what changed (deploys, config changes, traffic patterns) in the window before the incident
- Tracing the causal chain — which service is the origin vs. which are downstream effects
- Proposing a hypothesis — "p99 latency increased 3x in checkout-service at 02:41. A new version of payment-service was deployed at 21:14. payment-service calls increased from 120ms to 890ms after deploy. Probable cause: slow path in v2.3.1 of payment-service."
The quality of AI RCA is directly proportional to the quality of your telemetry. If your services don't emit traces, the causal chain analysis fails. If your metrics lack deploy annotations, the correlation with code changes is guesswork. Garbage telemetry in, garbage RCA out.
AI-Assisted Incident Triage
Triage is the decision: how severe is this, who owns it, what's the immediate mitigation? AI triage handles the first pass:
- Deduplication: Multiple alerts firing for the same root cause get grouped into one incident
- Severity scoring: Based on error budget burn rate, customer impact signals, and blast radius
- Ownership routing: Matching the affected service to the on-call team based on service ownership data
- Initial context: A structured summary — what's broken, since when, what's been tried — assembled before the engineer opens Slack
This matters most at scale. A 50-service system can generate hundreds of alerts during a cascading failure. AI triage turns that noise into a single actionable incident with a ranked list of suspects.
Predictive Alerting
Most alerting fires after something breaks. Predictive alerting uses ML to identify leading indicators — patterns that precede failures based on historical incident data.
Examples: - Memory growth rate trending toward OOM in 4 hours - Queue depth increasing in a pattern that preceded a consumer crash last month - Connection pool saturation approaching 90% before peak traffic window
Predictive alerts are harder to tune than threshold alerts — false positives are expensive in on-call fatigue terms — and they require significant incident history to build reliable models. Most mature implementations use them for specific high-value cases (OOM prediction, disk exhaustion) rather than general-purpose failure prediction.
Runbook Generation and Execution
Runbooks are the documented response procedures for known failure modes. AI SRE extends this in two ways:
Generation: Given an incident type and its history, AI can draft a runbook — essentially pattern-matching against past incidents and codifying the resolution steps. This is useful for capturing institutional knowledge that lives only in engineers' heads.
Execution: For safe, well-defined steps (restart a pod, scale a deployment, clear a cache), AI can execute runbook actions automatically. This is where most teams draw a cautious line — automated execution is only safe when the blast radius of a wrong action is bounded and reversible.
The honest reality: runbook execution is the most dangerous part of AI SRE. An AI that confidently executes the wrong runbook step based on a misdiagnosed incident can make things worse. Most production deployments in 2026 use human-in-the-loop approval for any execution step, and full autonomous execution only in isolated, well-tested environments.
AI SRE Tools in 2026
The AI SRE tooling landscape is fragmented. Some tools are full-stack incident management platforms with AI layered in. Others are purpose-built AI agents that integrate into your existing stack.
| Tool | Approach | Strengths | Considerations |
|---|---|---|---|
| incident.io | Incident management + AI features | Strong Slack-native workflow, good for coordination | RCA is advisory, not deep telemetry analysis |
| Harness | Platform engineering + AI reliability | Good CI/CD integration, deploy correlation | Heavy platform footprint, complex to adopt standalone |
| resolve.ai | Purpose-built AI SRE agent | Focused on RCA automation, integrates with PagerDuty | Newer, smaller ecosystem |
| Rootly | Incident management + AI | Clean UX, runbook automation | AI features are add-ons, not core differentiation |
| Last9 | Full-stack AI SRE platform (Gartner Cool Vendor) | Owns the telemetry layer (OTel-native, unified metrics/traces/logs) + AI SRE on top + agent SDK to build custom ops agents | Newer to autonomous execution; strongest where telemetry depth matters |
The key question when evaluating AI SRE tools: where does your observability data live? Tools that can't see your full telemetry — traces, metrics, logs together — can only do surface-level correlation. Tight integration between the AI layer and the observability store is what separates useful RCA from educated guesses.
One distinction worth calling out: most AI SRE vendors query your telemetry from a third-party store. Last9 owns the telemetry data platform itself — meaning the AI layer has direct access to raw, unsampled signals rather than a summary API. This is the architecture Gartner recognized when naming Last9 a Cool Vendor in AI for SRE and Observability in 2025.
How to Get Started with AI SRE using Last9
Last9 is an AI SRE platform built on a foundation it owns end-to-end: a unified telemetry data platform (metrics, traces, logs via OpenTelemetry) with AI SRE capabilities on top, plus an agent SDK that lets engineering teams build their own ops agents. It was named a Gartner Cool Vendor in AI for SRE and Observability in 2025 — one of the only vendors in that report that owns the telemetry layer rather than querying it from a third-party store.
This architecture matters: AI SRE on top of raw, unsampled telemetry produces better RCA than AI on top of a summarized API. The difference shows up at the hard incidents — cascading failures, high-cardinality service meshes, stateful system failures.
Step 1: Instrument with OpenTelemetry
OTel is the standard. Instrument your services to emit traces, metrics, and logs through OTel SDKs. For AI/LLM workloads, instrument at the LLM call level — capture prompt, model, tokens, latency. For traditional services, focus on HTTP spans, database queries, and message queue operations.
The instrumentation doesn't have to be complete on day one. Start with your critical path: the services involved in your most frequent or most costly incidents.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="<your-last9-otlp-endpoint>")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("checkout.process_payment") as span:
span.set_attribute("payment.provider", "stripe")
span.set_attribute("payment.amount_cents", amount)
result = process_payment(amount)
span.set_attribute("payment.status", result.status)See doing SRE the right way for how this fits into broader reliability practice.
Step 2: Define SLOs and Error Budgets
AI SRE tools need a signal to anchor on. Error budget burn rate is the cleanest signal: it's a single number that tells you how fast you're spending your allowed downtime. Configure SLOs in Last9 for your critical services — start with availability and latency.
An SLO like "99.9% of checkout requests under 500ms" gives the AI triage layer a severity anchor: a 10% burn rate over 1 hour is different from a 100% burn rate over 1 hour.
Step 3: Use Last9's AI for RCA and Incident Triage
Once your telemetry is flowing into Last9, the AI layer works across three modes:
Natural language queries — ask questions directly against your telemetry: - "What changed in payment-service in the 2 hours before p99 latency spiked?" - "Which services are failing the most in the last 30 minutes?" - "Show me all errors correlated with the deploy at 21:14"
Automated incident triage — Last9 correlates signals across services, groups related alerts into a single incident, scores severity against your SLO burn rate, and surfaces a structured summary before you open a dashboard.
RCA memory — Last9 builds a per-service fault library over time. When a new incident fires, it surfaces the most similar past incident with confidence score and the exact commands or PR that resolved it.
Step 4: Build Custom Ops Agents with the Agent SDK
Most SRE teams eventually want AI behavior tuned to their specific services, runbooks, and escalation logic — not a generic SaaS workflow. Last9's agent SDK lets you build programmable ops agents that embed telemetry intelligence directly into your delivery pipeline:
- Auto-remediation agents: detect cascading failures early, trigger predefined runbooks for specific systems (Postgres, Redis, MySQL)
- Custom RCA agents: define which signals matter for your service topology, how confidence is scored, what gets paged vs. suppressed
- Pipeline integration: embed agents into CI/CD to catch reliability regressions before they hit production
This is the layer that separates Last9 from pure incident management tools. You're not locked into one vendor's notion of what "AI SRE" means — you compose it from your own runbooks, your own telemetry, your own escalation policies.
Last9's MCP integration also lets AI agents in your IDE (Cursor, Claude Code, VS Code) query live production telemetry directly — useful when debugging in context without switching to a dashboard.
For AI-specific observability (monitoring LLM pipelines, token costs, agent traces), see LLM observability. To get started with Last9, sign up here.
Limitations of AI SRE Today
This section exists because most vendor content won't write it. Here's what AI SRE doesn't do well in 2026:
Non-determinism in LLM-based RCA. The same incident can produce different diagnoses across runs. This is fine for advisory output ("here are the likely causes") and dangerous for autonomous execution. Don't deploy AI runbook execution without human approval gates until you've validated the model's accuracy on your specific incident patterns.
Hallucinated runbooks. AI will confidently produce a runbook step that references a service, endpoint, or config key that doesn't exist in your environment. The runbook looks correct because it matches general patterns. Validate every AI-generated runbook against your actual system before treating it as executable.
Alert context gaps. AI correlation quality degrades when signals lack context. An alert that says "high CPU on host-42" with no service attribution, no deploy metadata, and no traces gives the AI almost nothing to work with. The AI produces low-confidence output and hedges — which is useless at 3 AM.
Coverage blind spots. Most AI SRE tools work well for web service reliability patterns. They work poorly for stateful systems (databases, queues, distributed consensus), GPU workloads, and infrastructure-layer failures. Know your failure modes and check whether the tool has seen similar patterns before betting on it.
Cost at scale. LLM inference is not cheap. At high alert volume, AI-analyzed incidents can add meaningful cost. Some tools solve this with tiered analysis (cheap model for triage, expensive model for deep RCA). Understand the pricing model before deploying broadly.
AI SRE is genuinely useful. It's also genuinely immature. The teams getting the most value from it are treating it as an augmentation layer — AI does the first 80% of investigation, engineer validates and decides — rather than an autonomous replacement.
Last9: AI SRE Built on Telemetry You Own
Most AI SRE tools are a layer on top of someone else's data. They query your Datadog or Prometheus, run an LLM over the results, and call it AI SRE. The problem: they're working from summaries, not the raw signal. When an incident is novel or multi-system, the summarized view loses exactly the detail that matters.
Last9 is built differently. It's an AI SRE platform that owns the telemetry data platform underneath — unified metrics, traces, and logs via OpenTelemetry, stored cost-efficiently with no sampling loss. The AI layer sits directly on top of that raw data, which is why RCA quality holds up at the hard incidents where other tools hedge.
On top of the telemetry layer, Last9 ships: - AI-driven incident triage — correlates signals, groups alerts, scores against SLO burn rate, surfaces the root event - RCA memory — per-service fault library that matches new incidents to past resolutions with confidence scores - Agent SDK — build programmable ops agents tuned to your runbooks, escalation logic, and service topology; not locked into a vendor's workflow
It was named a Gartner Cool Vendor in AI for SRE and Observability in 2025 — one of the only vendors in that report that combines owned telemetry with AI SRE capabilities and an open agent SDK.
FAQ
What is AI SRE? AI SRE (AI Site Reliability Engineering) is the application of AI systems — LLMs, ML models, and automation — to core SRE workflows: incident detection, root cause analysis, on-call triage, and runbook execution. The goal is to reduce mean time to resolution by automating the investigative work that currently requires a human engineer to run through dashboards and logs manually.
How does AI SRE work? AI SRE systems ingest telemetry data (metrics, logs, traces) from your infrastructure and application layer. When an incident fires, the AI correlates signals across services, identifies the probable root cause, groups related alerts into a single incident, and produces a structured summary. More advanced systems can execute remediation steps (restart, scale, rollback) with or without human approval.
How is AI SRE different from traditional SRE? Traditional SRE relies on human engineers supported by alerting, dashboards, and runbooks. AI SRE shifts the investigative work from human to machine, so engineers enter the incident later in the process — after context has already been assembled. The engineering judgment (is this hypothesis correct? should I execute this fix?) stays with humans.
What are the best AI SRE tools in 2026? The leading tools are incident.io (incident management with AI), Harness (platform engineering + reliability), resolve.ai (purpose-built AI SRE agent), Rootly (incident management + runbook automation), and Last9 (OpenTelemetry-native observability with AI assistant for RCA). The right choice depends on whether you need a full incident management platform or AI capabilities layered over an existing observability stack.
Is AI SRE production-ready? For triage, noise reduction, and advisory RCA: yes, and it's proven in production at scale. For autonomous runbook execution: conditionally — safe for bounded, reversible operations (restart, scale) with human approval gates; not recommended for complex multi-step remediations without significant validation on your specific environment.
