
AI agents are quietly generating chaos engineering failures that most enterprises aren’t tracking — because the incidents don’t fit any existing postmortem template. The agent initiates an action. The action is technically correct given its context. The context is incomplete. The infrastructure cascades. And by the time the incident review happens, three teams are arguing about whether it was an agent failure or an infrastructure failure, because the frameworks for thinking about these two things have never been connected.
The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.
What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.
The gap between chaos engineering and agent autonomy
Sayali Patil, who spent six years building infrastructure automation systems at Cisco and Splunk, and filed a patent on intent-based chaos engineering methodology, has watched organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines.
“They are not,” Patil writes in a detailed analysis. “They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.”
Most mature engineering organizations have invested in programs for that discipline. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, a person makes a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable.
When you introduce an autonomous remediation agent — one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies — that question disappears. The agent sees an anomaly. It takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.
Related: Enterprise AI Faces Growing Debt Risks
A failure mode nobody tested for
Here is the specific failure mode she has watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster — a reasonable action given its training data and its narrow view of the incident.
What the agent doesn’t know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.
What started as a latency spike the agent was designed to fix becomes a cascade it was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.
Nobody’s chaos engineering program had tested for that specific combination. Nobody’s blast radius calculation had included the agent as an actor. Because we don’t think of agents as chaos injectors. We should.
According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.
The resilience budget model
The underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. These programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agents don’t manage it at all.
Through structured primary research with site reliability engineering and platform engineering practitioners across organizations including Intuit and GPTZero, she has been developing a resilience budget model. The core idea is to treat absorb capacity as a continuously recomputed, consumable resource rather than a static threshold you try not to breach.
A resilience budget draws on four live signal classes:
Related: Dun and Bradstreet Rebuilds Business Database
- SLO burn rate is the primary input, because it directly encodes the distance between current system behavior and the commitment that actually matters.
- P99 latency trend matters more than absolute latency, because a service trending upward over forty minutes tells you something different than a service that has been stable at the same absolute value.
- Dependency saturation state is the most commonly missed signal; a chaos experiment or an agent action that assumes a shared connection pool is freely available when it’s sitting at 87% will produce failure modes that nobody designed for.
- Application behavioral signals — session completion rates, API call pattern shifts, conversion degradation — surface system stress earlier than infrastructure metrics do.
What makes this a budget rather than a threshold is that it is consumable. Every chaos experiment draws from the available capacity. Every agent action draws from it. In multi-team organizations where multiple experiments and multiple agents may be acting simultaneously, the budget is shared.
Without a shared ledger of consumption, two teams running experiments against overlapping dependencies produce a combined blast radius that neither team planned. Add autonomous agents acting completely outside the ledger, and the accounting collapses.
Where AI helps and where it doesn’t
Several engineering organizations are now running experiments using large language models to generate chaos hypotheses from dependency graphs and incident postmortem corpora. The results are directionally useful. Language models surface plausible failure modes that experienced SREs recognize as worth testing, and they generate hypotheses faster than manual processes.
The limit is dependency graph staleness, and it is a hard limit. A hypothesis generated from a graph that doesn’t reflect last month’s service extraction, or a new shared library dependency added two sprints ago, will propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake — it’s that the model doesn’t know it’s making one.
Stanford’s Trustworthy AI Research Lab found that model-level guardrails alone are insufficient: Fine-tuning attacks bypassed leading models in the majority of tested cases. The implication for chaos hypothesis generation is direct — a model that cannot reliably hold its own safety boundaries cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.
When hypothesis generation draws instead from postmortem corpora, the staleness problem shrinks considerably. Postmortems describe failures that actually occurred in the system at a specific moment in time. The signal is inherently validated by production reality.
What AI cannot do, and should not be asked to do, is make the execution decision when signals are ambiguous. That judgment requires awareness of things that live entirely outside any monitoring system: Pending deployments that changed the dependency landscape an hour ago, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable until Monday.
A model without access to that context should not be making that call. This is not a temporary limitation pending a more capable model. It is a structural constraint of what machine observability can represent.
Related: Google Expands Search Box Capabilities
Governing agents like chaos experiments
The governance implication is straightforward to describe and harder to implement than it sounds. Every autonomous agent action that touches infrastructure needs to register against the same live signal layer that governs chaos experiments. The same SLO burn rates, latency trends, dependency saturation states that a human engineer would check before initiating an experiment should gate what an agent is permitted to do and when. If the resilience budget is below a defined floor, the agent waits or escalates. It does not act.
Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question isn’t only whether the restart completed successfully. It’s whether the blast radius of that action was proportionate to the available absorb capacity, and what cascading effects it produced across dependencies.
And when signals are genuinely ambiguous — when the budget score is unclear, when a recent deployment has changed the topology in ways the agent’s context window doesn’t capture, when dependency states are in flux — the execution decision needs to go to a human. Not as a permanent limitation on agent autonomy, but as a hard engineering requirement for the current state of the technology.
A circuit breaker that hands ambiguous cases to a human is not a weakness in the agent architecture. It is the thing that makes the architecture trustworthy enough to actually run in production.
The organizations that operate autonomous agents reliably at scale are not the ones with the most sophisticated models. They are the ones that understood, before something went badly wrong, that every agent action is a chaos event and built their governance layer accordingly.
The practical first step is unglamorous: Audit every autonomous agent currently touching infrastructure, map its action surface against your live SLO burn rate signals, and define explicit floor conditions below which the agent is required to wait or escalate. That audit will surface agents acting entirely outside your resilience accounting.
Most organizations running agents at scale today have several. Find them before production does.
Leave a Reply