Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    GitHub will start charging Copilot users based on their actual AI usage

    20 fun facts to celebrate Google Translate turning 20

    Former FCC officials want to force a vote on the ‘weapon’ Brendan Carr has invoked against broadcasters

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»The Next Frontier of AI in Production Is Chaos Engineering
    AI Tools

    The Next Frontier of AI in Production Is Chaos Engineering

    By No Comments18 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Abstract network visualization representing distributed system dependencies and chaos engineering blast radius
    Share
    Facebook Twitter LinkedIn Pinterest Email

    that no chaos engineering tool in production today can answer: Did your last experiment test the right thing?

    Not ‘Did it stay within budget?’ That is what SLO error-budget gating handles. Not ‘Did the system survive?’ That is what abort conditions measure. The question is whether the experiment was designed to validate a specific belief about your system’s behavior, and whether its outcome changed what your team knows about failure propagation through your stack.

    If your honest answer is ‘we terminated some pods, and they recovered,’ you ran a safe experiment. Whether you learned anything useful is a separate question that current tooling does not ask.

    This article makes a concrete argument: chaos engineering has a mature safety layer and an almost nonexistent intent layer. Safety tells you how much to break. Intent tells you what breaking it will teach. These are different design problems requiring different tooling, and conflating them is why chaos programs at scale tend to accumulate scripts without accumulating insight.

    The argument is grounded in the architecture I developed and patented (US12242370B2, Intent-Based Chaos Engineering for Distributed Systems), and in observations from practitioners across Intuit, GPTZero, Insurance Panda, Fruzo, and Coders.dev who have independently diagnosed the same structural gap. I will show you the architecture, walk through the data model with code, and explain why this is an AI problem, not just an orchestration problem.

    1. The Safety Layer Is Good. It Is Also Incomplete.

    Start by giving the current model its due. The SLO error-budget framework, popularized by Google’s SRE practice, gave chaos engineering its first principled safety mechanism. Tying experiment execution to the remaining error budget means you do not inject failure into a system already consuming its reliability headroom [3]. AWS Fault Injection Service’s stop conditions, Gremlin’s reliability score, and Harness ChaosGuard’s Rego policies all represent mature, production-ready implementations of this idea.

    These tools answer a well-posed question: given the current state of my system, is it safe to run an experiment right now? The answer is computable, automatable, and reasonably accurate. The question they do not answer is equally important: given the current state of my system, which experiment would be most informative to run right now?

    Safety and informativeness are orthogonal. An experiment can satisfy every safety constraint, stay within budget, trigger no aborts, cause no measurable degradation, and still produce nothing useful. If it tested a component not in the critical path of any user-facing behavior, you spent budget learning nothing. If it repeated a failure mode your system has survived a dozen times without updating your understanding of the propagation path, same result.

    Core distinction: An experiment is safe when it stays within acceptable cost. An experiment is informative when its outcome updates your model of the system’s failure behavior. These require different design criteria, and only the first has mature tooling.

    There is a second structural problem. Scripts are static at the moment of authorship. They encode assumptions about service topology, traffic patterns, and dependency behavior that may be accurate when written and silently wrong six months later. As microservice architectures change weekly, script-to-reality drift accumulates. The script still runs. It tests a world that no longer exists.

    2. How Practitioners Describe the Ceiling

    The following observations were gathered from practitioners via Qwoted, a platform connecting domain experts with researchers and journalists. A cross-industry survey of engineers who have built chaos programs in production converges on the same structural gap from different angles.

    Abhishek Pareek, Founder and Director at Coders.dev, builds distributed systems tooling. His framing is the sharpest diagnosis of the problem:

    “What we do not have is an understanding of intent-based resiliency. Existing tools are primarily script-based, and we need to create tools that can model the effects of a specific failure on a large number of microservices before executing the experiment. We need AI that understands the reasoning behind the failure in addition to the mechanics of the failure.” — Abhishek Pareek, Founder & Director, Coders.dev [6]

    The word ‘reasoning’ is doing real work here. A script captures mechanics: terminate these pods, inject this latency. It does not capture reasoning: we are running this experiment because we believe the checkout circuit breaker should trip before user-facing error rates climb above 0.1%, and we want to know if it actually does. That reasoning, the hypothesis, is what makes an experiment informative. When it lives only in the engineer’s head, it evaporates as teams and systems change.

    Edward Tian, CEO of GPTZero, runs AI inference infrastructure at scale and has developed precise language for what is missing:

    “Current chaos tools inject arbitrary points of failure but do not provide any meaningful direction for the user in terms of what they are attempting to validate. The next evolution of chaos will involve targeting specific questions about resiliency, ‘can our systems sustain a degradation in the retrieval of data?’ or ‘are we capable of tolerating a model being unavailable due to a timeout?’, rather than the use of a one-size-fits-all script.”
    – Edward Tian, Founder & CEO, GPTZero [7]

    “Can our systems sustain a degradation in the retrieval of data?” is a behavioral hypothesis. It names a target behavior, a failure condition, and an implicit success criterion. That is more information than any current chaos tool accepts as input. It is the minimum information needed to design a test that answers the question.

    3. The Intent-Based Architecture

    US Patent 12242370B2 describes a system in which chaos experiment parameters are derived from behavioral intent specifications rather than hardcoded by engineers. Here is how the architecture works.

    3.1 System Overview

    The system has four layers. Each layer does something the script-based model cannot. The experiment generator replaces ‘pick a script’ with ‘derive the right experiment from what you want to learn.’ The safety evaluator adds behavioral context to the blast-radius calculation. The outcome recorder turns experiment results into model updates rather than postmortem notes.

    Figure 1: Intent-Based Chaos Engineering system architecture (Image by author)

    3.2 The Intent Specification

    The specification is the input the system requires before generating any experiment. Here is a concrete example for a checkout resilience test:

    Listing 1 – Intent specification for a checkout resilience experiment

    # intent_spec.yaml
    intent:
      id: exp-checkout-inv-2025-01
      target_behavior: checkout_completion
      hypothesis: >
        The checkout flow completes within SLO when the inventory
        service experiences elevated read latency (p99 > 500ms).
        The circuit breaker on inventory_read trips before the
        user-facing error rate exceeds 0.1%.
      acceptance_criteria:
        checkout_p99_latency_ms: 400
        checkout_error_rate_pct: 0.1
        slo_budget_fraction: 0.001   # max 0.1% of daily error budget
      exclusion_zones:
        - payment_auth
        - fraud_detection
        - session_management
      min_steady_state_window: 15m   # require stable baseline before injection
      max_experiment_duration: 20m

    Notice what this encodes that a conventional chaos script does not: the hypothesis is a falsifiable statement about system behavior, not a description of what will be broken. The acceptance criteria define what ‘pass’ means in behavioral terms. The exclusion zones and steady-state window enforce constraints most teams handle manually and inconsistently.

    3.3 From Specification to Experiment Candidates

    The experiment generator traverses the service dependency graph to find all components on the critical path of the target behavior. Here is a simplified Python sketch of that traversal:

    Listing 2 – Simplified critical-path traversal using a weighted dependency graph

    from typing import List, Dict
    import networkx as nx
    
    def get_critical_path_components(
        graph: nx.DiGraph,
        target_behavior: str,
        exclusion_zones: List[str]
    ) -> List[Dict]:
        candidates = []
        for node in nx.descendants(graph, target_behavior):
            if node in exclusion_zones:
                continue
            edge_data = graph.edges[target_behavior, node]
            candidates.append({
                'component': node,
                'call_frequency': edge_data.get('call_freq', 0),
                'degradation_sensitivity': edge_data.get('sensitivity', 0),
                'in_blast_radius_of': list(nx.ancestors(graph, node))
            })
        return sorted(
            candidates,
            key=lambda x: x['degradation_sensitivity'] * x['call_frequency'],
            reverse=True
        )

    The edge weights, call_frequency, and degradation_sensitivity are learned from past experiments and from observability telemetry (traces, service mesh metrics). A component that sits on every checkout request AND whose degradation historically propagates to user-facing errors ranks highest. One that sits on a background job ranks near zero.

    4. Real-Time Safety Evaluation: Beyond Static Thresholds

    Ishu Anand Jaiswal, Senior Engineering Leader at Intuit, identifies the component that makes safety evaluation genuinely intelligent rather than just automated:

    “What’s missing for truly intelligent chaos is an AI planner that understands live topology and ‘resilience budget.’ It should continuously estimate how much additional latency, loss, or resource depletion the system can absorb, then select and sequence experiments that maximize learning while staying inside that budget, updating its model from every run and from real incidents.” — Ishu Anand Jaiswal, Senior Engineering Leader, Intuit [8]

    The ‘resilience budget’ concept is different from the SLO error budget. The error budget measures how much reliability you have already consumed this period. The resilience budget is prospective: given the system’s current state, how much additional stress of a specific type can it absorb before behaviors outside the experiment’s scope begin to degrade?

    Table 1 below shows how static threshold gating compares to real-time resilience scoring across five key signals:

    Signal Static Threshold Gating Real-Time Resilience Scoring
    SLO error budget Checked once at experiment start Continuously monitored; abort triggered if burn rate spikes
    Dependency health Not checked p99, error rate, circuit-breaker state read from service mesh before and during injection
    Blast radius Fixed fraction of replicas (e.g. 10%) Dynamically estimated from dependency graph + historical sensitivity weights
    Abort signal Infrastructure metric crosses threshold Target behavior degradation (e.g. checkout completion rate drops > 2%)
    Topology awareness None, script targets fixed components Live dependency graph; experiment reroutes if target component already degraded
    Learning None, script unchanged after run Predicted vs. actual blast radius delta updates edge weights for future runs
    Table 1: Static threshold gating vs. real-time resilience scoring

    The abort signal row is where the behavioral framing produces its most concrete difference. Instead of halting when service latency crosses a threshold, an intent-aware experiment halts when the target behavior, checkout completion, degrades beyond the acceptance criterion. A latency spike on an irrelevant component does not stop the experiment. A latency spike on the checkout critical path stops it immediately, regardless of what the infrastructure dashboards show.

    5. The User-Context Problem Infrastructure Metrics Cannot Solve

    Isabella Rossi, CPO at Fruzo, has built chaos mechanisms on top of behavioral signals rather than infrastructure metrics. Her observation cuts to a problem blast-radius control cannot address:

    “Chaos engineering tools typically treat system resilience as a static property. They inject stress based on time of day or load thresholds, which misses how brittle a system can be in one user context and perfectly stable in another. A database timeout during signup is catastrophic. The same timeout during an optional feature is barely noticeable. Current tools don’t make that distinction.” – Isabella Rossi, Chief Product Officer, Fruzo [9]

    This is technically precise, not just intuitive. A write timeout to the user registration table during a signup flow terminates a session. A write timeout to a feature-flag read cache during a preferences page falls back to defaults silently. Both events look identical on infrastructure dashboards, elevated timeout rate on a database connection pool. Their user impact differs by orders of magnitude.

    Table 2 illustrates how the same fault, on the same component, produces wildly different blast-radius severity depending on which user behavior is active:

    Fault Component User Context Blast-Radius Severity
    DB write timeout user_profile_db Signup flow CRITICAL, session terminated, user lost
    DB write timeout user_profile_db Preferences update LOW, silent fallback to defaults, invisible to user
    Pod termination inventory_service Active checkout HIGH, checkout may fail or stall beyond SLO
    Pod termination inventory_service Nightly batch sync NEGLIGIBLE, batch retries automatically
    Latency +200ms recommendation_api Homepage load LOW, async; page renders without recommendations
    Latency +200ms recommendation_api Checkout upsell step MEDIUM, synchronous call; adds +200ms to checkout
    Table 2: Blast-radius severity depends on active user behavior, not just component health

    A script-based chaos tool has no way to populate the ‘User context’ column. It does not know which user behaviors are active when the experiment runs. An intent-based system can, because the intent specification names the target behavior, and the experiment generator only considers components in that behavior’s critical path under current traffic.

    6. The Business-Signal Extension: Blast Radius in Dollars

    Once you anchor experiments to behaviors rather than components, the logical extension of that principle reaches further than most SRE practice goes today.

    James Shaffer, Managing Director at Insurance Panda, has rebuilt his entire chaos program around revenue signals:

    “Static scripts are garbage. They don’t respect the network’s current state. We tied our fault injection engine directly to live business metrics, not just server loads. If active quote completions drop by even two percent, the test instantly kills itself. It’s an automated kill switch based on revenue, not latency. What’s missing from genuinely intelligent chaos testing isn’t better AI to break things. It’s AI that understands the blast radius in dollar amounts. A microservice failing might look like a catastrophic outage to an SRE. But if it doesn’t stop a user from buying an auto policy, who cares? Smart chaos needs to learn the difference between technical noise and actual financial bleeding.” — James Shaffer, Managing Director, Insurance Panda [10]

    Shaffer’s kill switch, triggered by a 2% drop in quote completions, is a direct production implementation of a behavioral acceptance criterion. The abort signal is the business transaction rate, not a p99 latency threshold. Here is what that looks like in the outcome data model:

    # outcome_record.yaml
    outcome:
      experiment_id: exp-checkout-inv-2025-01
      hypothesis_result: SUPPORTED   # circuit breaker tripped as predicted
      abort_reason: null             # experiment ran to completion
      # behavioral signals (acceptance criteria)
      checkout_p99_latency_ms: 312   # passed: < 400ms
      checkout_error_rate_pct: 0.04  # passed: < 0.1%
      checkout_completion_rate_delta: -0.3%  # passed: < 2% threshold
      # blast radius: predicted vs actual
      predicted_blast_radius:
        - inventory_read_service
      actual_blast_radius:
        - inventory_read_service
        - cart_service   # DISCOVERED dependency, not in graph model
      budget_consumed_pct: 0.00083
      # model update signals
      graph_updates:
        - add_edge: [checkout, cart_service]
          sensitivity_weight: 0.34
      blast_radius_prediction_error: 0.34

    The most valuable line in this record is the discovered dependency: cart_service was not in the graph model, but the experiment revealed it responds to inventory_read degradation. That update propagates forward, the next checkout experiment will include cart_service in its blast-radius evaluation. This is how the system’s model of itself improves over time, without human curation.

    7. Why This Is an AI Problem, Not Just an Orchestration Problem

    The reasonable objection at this point is that everything described above sounds like engineering work, dependency graph traversal, threshold comparison, structured logging. Do we really need AI for this, or just better plumbing?

    The plumbing handles deterministic decisions: if burn rate exceeds X, abort. If latency crosses Y, halt. These are the guardrails current tools implement. They are valuable and closed under known assumptions. The problems that require learned models are the ones where the decision space is not enumerable:

    • Blast-radius prediction on novel topologies. Predicting second-order effects of a fault on components not directly targeted requires generalization from behavioral patterns in past experiments. You cannot enumerate all possible service graphs at authoring time.
    • Hypothesis generation. Translating ‘test checkout resilience under inventory degradation’ into a ranked list of fault types ordered by expected informativeness is not rule execution. It requires reasoning about semantic relationships between service behaviors.
    • Sensitivity weight learning. The edge weights in the dependency graph are not static properties. They shift with traffic patterns, caching behavior, and deployment changes. They need to be learned continuously from experimental outcomes.
    • Anomaly attribution during experiments. When multiple signals move simultaneously during an experiment, determining which movement is caused by the injected fault versus pre-existing conditions requires a counterfactual model. That is a causal inference problem.

    This last point is where the field is furthest from a solution. Adaptive chaos tools are decent at correlating signals but cannot explain why a specific fault cascades the way it does through a given topology [4]. Building that capability requires something no current chaos tool attempts: a causal model of failure propagation that can be updated from experiment outcomes and interrogated with counterfactual queries.

    Figure 2: Safety-Driven Chaos Vs. Intent-Driven Chaos (Image by author)

    8. The Counterargument, Taken Seriously

    Mature teams already write hypothesis statements. The Chaos Engineering principles from Basiri et al. (2016) require defining steady-state behavior before injection [2]. Netflix, Google, and Intuit run disciplined programs where engineers document what they expect to happen before running experiments. Is ‘intent-based chaos engineering’ just a description of what careful practitioners already do?

    The objection is partially correct. Mature teams do maintain hypothesis statements. The problem is that they maintain them in documentation, not in tooling. The hypothesis exists in a Notion page. The chaos tool that executes the experiment has no access to it. This creates four specific gaps:

    •  The tool cannot verify that the experiment design actually tests the stated hypothesis, a mismatch between documented intent and configured fault is never caught

    •  The tool cannot adapt the experiment based on real-time system state relative to the hypothesis, it runs regardless of whether current conditions make the test meaningful

    •  The tool cannot update a dependency model based on the delta between predicted and actual blast radius, that signal is lost to a postmortem document

    •  The tool cannot prevent the same hypothesis from being tested redundantly, script libraries grow, insight does not

    The difference between ‘teams do this manually’ and ‘tooling does this computable’ is the difference between a practice that scales with the team and one that does not. When the engineer who wrote the hypothesis statement leaves, so does the intent. When the system topology changes, the hypothesis may no longer correspond to any real experiment design, and nothing catches that.

    9. Three things the field needs to build

    The architecture exists. The safety primitives it depends on are mature. The observability infrastructure it requires is widely deployed. Three specific gaps remain between where the field is and where it needs to go.

    Gap 1: A standard intent specification schema

    Every team that does hypothesis-driven chaos engineering uses its own format, a Notion template, a runbook section, a JIRA ticket type. None of these are machine-readable by chaos tooling. The five fields in Listing 1 above (target_behavior, hypothesis, acceptance_criteria, budget_fraction, exclusion_zones) capture the essential structure. Standardizing this schema, analogous to how OpenAPI standardized service interface descriptions, would let tooling ingest, validate, and act on hypotheses rather than ignoring them.

    Gap 2: Structured experiment outcome data

    Blast-radius prediction requires training data. Almost no teams currently record experiment outcomes in a structured, queryable format. Outcomes live in Slack threads and postmortem documents. The outcome schema in Listing 4 is a starting point. Instrumenting existing chaos tools to emit structured outcomes automatically, and storing them in a queryable format alongside the dependency graph, would generate the training signal that predictive models need.

    Gap 3: Hypothesis-quality evaluation

    Chaos programs are currently evaluated on coverage (how many services have been tested) and survival (did the system hold). Neither measures whether experiments were informative. A hypothesis-quality score, did this run’s outcome change the team’s belief about the system, and by how much?, would give practitioners a signal for improving experiment design rather than just accumulating scripts. None of these require new research. They require the field to agree on representations and invest in the data infrastructure that makes learning from experiments computable rather than anecdotal.

    Conclusion

    Chaos engineering has the right safety primitives. What it lacks is an equally principled approach to informativeness. Without an intent layer, chaos programs tend toward two failure modes: scripts that test the same things repeatedly, and experiments that stay within budget while producing nothing worth learning.

    The intent-based architecture addressed in this article does not replace the safety mechanisms the field has built. It adds a layer that makes those mechanisms more meaningful, grounding them in what the operator is actually trying to learn, deriving experiments from behavioral specifications rather than engineering folklore, and accumulating a model of the system’s failure dynamics that improves with each run.

    The gap is real, structural, and solvable. The question is whether the field builds the infrastructure to close it, or keeps writing scripts.

    References

    [1] M. P. Amador, K. P. Annamali, S. Jeuk, S. Patil, M. F. K. Wielpuetz, Intent-Based Chaos Level Creation to Variably Test Environments, US12242370B2 (2025), Cisco Technology Inc., United States Patent and Trademark Office

    [2] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, C. Rosenthal, Chaos Engineering (2016), IEEE Software, 33(3), 35–41

    [3] B. Beyer, C. Jones, J. Petoff, N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems (2016), O’Reilly Media

    [4] D. Kikuta, H. Ikeuchi, K. Tajiri, ChaosEater: Fully Automating Chaos Engineering with Large Language Models (2025), arXiv:2501.11107

    [5] L. C. Opara, O. N. Akatakpo, I. C. Ironuru, K. Anyaene, B. O. Enobakhare, Chaos Engineering 2.0: A Review of AI-Driven, Policy-Guided Resilience for Multi-Cloud Systems (2025), Journal of Computer, Software, and Program, 2(2), 10–24

    [6] A. Pareek, Expert Practitioner Response on Intent-Based Resiliency (2025), Qwoted — Coders.dev

    [7] E. Tian, Expert Practitioner Response on Hypothesis-Driven Chaos Engineering (2025), Qwoted — GPTZero

    [8] I. A. Jaiswal, Expert Practitioner Response on AI Planning and Resilience Budgets (2025), Qwoted — Intuit

    [9] I. Rossi, Expert Practitioner Response on User-Context Resilience (2025), Qwoted — Fruzo

    [10] J. Shaffer, Expert Practitioner Response on Business-Metric Chaos Engineering (2025), Qwoted — Insurance Panda

    Chaos engineering frontier Production
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAsus ROG Zephyrus Duo (2026) review: 2 screens 2 furious
    Next Article A billion miles in less than a decade: GM’s Super Cruise reaches a milestone
    • Website

    Related Posts

    AI Tools

    Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

    AI Tools

    Correlation Doesn’t Mean Causation! But What Does It Mean?

    AI Tools

    Adaptive Ultrasound Imaging with Physics-Informed NV-Raw2Insights-US AI

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    GitHub will start charging Copilot users based on their actual AI usage

    0 Views

    20 fun facts to celebrate Google Translate turning 20

    0 Views

    Former FCC officials want to force a vote on the ‘weapon’ Brendan Carr has invoked against broadcasters

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    GitHub will start charging Copilot users based on their actual AI usage

    0 Views

    20 fun facts to celebrate Google Translate turning 20

    0 Views

    Former FCC officials want to force a vote on the ‘weapon’ Brendan Carr has invoked against broadcasters

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.