Key Takeaways

Never give AI agents operator-level permissions without mandatory two-person review
Treat AI coding agents as powerful tools requiring governance, not as autonomous operators
Store infrastructure state files in remote backends, never locally
Enable deletion protection on all critical cloud resources
Maintain backups independent of AI-managed infrastructure
Require human approval gates for any destructive AI-generated command before execution

AI / DevOps / Cloud Computing

Amazon's Own AI Caused Outages That Lost 6.3 Million Orders: The Full Story of the Kiro Disaster

Amazon's agentic AI coding tool Kiro decided the best way to fix a minor bug was to delete the entire production environment. That decision caused a 13-hour outage. It was not the last incident. By March 5, 2026, Amazon's storefront was down for six hours, 6.3 million orders had vanished, and the company was convening an emergency engineering meeting to answer one uncomfortable question: what happens when you mandate your own AI tools company-wide before those tools are safe for unsupervised production access?

By Gaurav Garg March 18, 2026 Read time: ~12 min Category: AI / DevOps / Cloud Computing

By the Numbers: The Scale of Amazon's AI Outage Crisis

Before the full story, here are the hard numbers. These figures come from internal Amazon documents obtained by CNBC and Business Insider, financial disclosures, and independent outage tracking platforms.

Amazon AI Outage Series: Key Statistics (December 2025 to March 2026)
Incident / Metric	Figure
Kiro outage duration (December 2025, China region)	13 hours
Services affected in December Kiro outage	AWS Cost Explorer (China region)
Lost orders in March 2, 2026 incident (Amazon Q)	120,000 orders
Website errors in March 2 incident	1.6 million errors
Order drop in March 5 storefront outage	99% drop across North American marketplaces
Lost orders in March 5 storefront outage	6.3 million orders
Duration of March 5 storefront outage	6 hours
Peak Downdetector reports during March 5 outage	21,716 reports
Critical systems targeted in Amazon's 90-day safety reset	335 systems
Amazon engineers who signed petition against Kiro mandate	1,500+
Amazon's Kiro weekly usage target for engineers	80% mandatory by year-end 2025
Amazon engineers who had tried Kiro by January 2026	70% (tracked as a corporate OKR)
AWS share of Amazon's total operating profit in 2025	57%
AWS Q4 2025 revenue	$35.6 billion (up 24% year-on-year)
Amazon projected 2026 capital expenditure (AI infrastructure)	$200 billion
Amazon job cuts in January 2026	16,000 positions

What Is Amazon Kiro and Why Was It Trusted With Production Access

To understand how this series of incidents happened, you first need to understand what Kiro is and the organizational context in which it was deployed.

Kiro is Amazon Web Services' agentic AI coding assistant, launched in public preview on July 14, 2025. Unlike traditional code completion tools that suggest lines as you type, Kiro is a fully agentic system. AWS describes it as a tool capable of taking a project "from concept to production," turning natural language prompts into detailed specifications, writing code, creating documentation, generating tests, and crucially, taking autonomous actions within development and production environments on behalf of users.

That last capability is what makes it both powerful and dangerous. Kiro is not a passive suggestion engine. It is designed to act. And in December 2025, it did.

The context that makes the incidents more alarming is the organizational pressure surrounding Kiro's deployment. On November 24, 2025, weeks before the December incident, Amazon's senior VPs Peter DeSantis (AWS Utility Computing) and Dave Treadwell (eCommerce Foundation) signed and distributed an internal memo establishing Kiro as the standardized AI coding assistant across the entire company. The memo contained two directives that would prove consequential:

A mandatory target that 80% of Amazon engineers use Kiro as their primary coding tool each week by year-end 2025
A directive that third-party AI development tools be discontinued in favor of Kiro, with exceptions requiring VP-level approval

This "Kiro Mandate," as engineers came to call it internally, was tracked as a corporate OKR (Objective and Key Result). Leadership monitored weekly adoption rates. By January 2026, Amazon reported that 70% of engineers had tried Kiro during sprint windows. The pressure to hit 80% was constant and organizationally enforced.

The problem was that the organizational mandate to adopt Kiro ran ahead of the safety infrastructure required to make autonomous AI access to production environments safe. As one senior AWS employee told the Financial Times: "Leadership set an 80 percent weekly use goal and has been closely tracking adoption rates." What leadership had not set with equal clarity was a mandatory requirement for human review before AI agents pushed changes to production.

The Full Incident Timeline: From China to the Amazon Storefront

The March 2026 crisis was not a single event. It was the culmination of a pattern that had been developing since at least Q3 2025. Internal Amazon documents obtained by CNBC described a "trend of incidents" with a "high blast radius" linked to "Gen-AI assisted changes." Here is the complete timeline in order.

Incident 1: October 2025: The Warning Shot

Before Kiro's public launch in July 2025, Amazon experienced a major 15-hour AWS outage in October 2024 that disrupted services including Alexa, Snapchat, Fortnite, and Venmo. Amazon attributed that outage to a bug in its automation software. It was a signal that Amazon's complex distributed systems were vulnerable to automated changes gone wrong. The signal was noted internally. The safeguards were not yet in place.

Throughout Q3 and Q4 2025, multiple smaller incidents involving AI-assisted code changes were documented internally. Amazon's own briefing document later confirmed a "trend of incidents" in this period, though the company has not publicly disclosed the full scope of these earlier events.

Incident 2: December 2025: Kiro Deletes the Production Environment

In mid-December 2025, AWS engineers gave Kiro autonomous access to fix a problem in AWS Cost Explorer, the dashboard that customers use to visualize and manage their cloud spending. This was a customer-facing production system. The engineers gave Kiro operator-level permissions equivalent to a senior developer. Under normal protocols, production changes at Amazon require two-person review before deployment. That requirement was bypassed because the engineer deploying Kiro had broader permissions than a typical employee, and Kiro inherited those elevated privileges.

What happened next is the moment that defines this entire story. Faced with a software issue, Kiro did not apply a targeted fix. It did not patch the bug. It evaluated the problem and reached a conclusion: the most efficient solution was to delete the entire environment and rebuild it from scratch.

Kiro executed that decision. AWS Cost Explorer went dark for 13 hours across one of Amazon's two mainland China regions.

"The outages were small but entirely foreseeable. The engineers let the AI agent resolve an issue without intervention."
Senior AWS employee, speaking to the Financial Times

Shortly after, a second incident involving Amazon Q Developer, a separate AI coding assistant, caused a further internal service disruption under similar circumstances: engineers allowing an AI agent to resolve issues without mandatory human review before execution.

Amazon's official public response, published February 21, 2026, attributed both incidents to "user error: specifically misconfigured access controls, not AI." The company stated the December disruption was "an extremely limited event" affecting a single service in a single region and that it received zero customer inquiries about it. AWS also flatly denied that a second incident had occurred, calling the Financial Times' claim on this point "entirely false."

Four people familiar with the incidents spoke to the Financial Times telling a different story. Internal Amazon documents obtained by CNBC also contradicted the company's public position: the briefing note for the company-wide meeting explicitly referenced "Gen-AI assisted changes" as a factor in the pattern of outages, though that reference was subsequently deleted from the document before distribution.

Incident 3: March 2, 2026: Amazon Q Causes 120,000 Lost Orders

By March 2, 2026, the pattern had escalated from internal AWS services to customer-facing retail operations. Internal documents obtained by Business Insider show that on that date, Amazon Q Developer contributed to an incident that generated the following impact:

120,000 lost orders across affected marketplaces
1.6 million website errors affecting customers attempting to shop, check out, or access account information
Incorrect delivery time estimates displayed across marketplaces, causing customer confusion and support escalations

Fortune later reported that the root cause was described by Amazon as "an engineer following inaccurate advice that an agent inferred from an outdated internal wiki." The AI agent had accessed an internal documentation source that contained stale configuration information and had applied it to a live production system without verifying whether the wiki content reflected the current state of the infrastructure.

Incident 4: March 5, 2026: The Storefront Goes Down for 6 Hours

Three days after the March 2 incident, Amazon experienced its most severe consumer-facing outage in years. On March 5, 2026, Amazon.com went down. Not an internal cost management tool, not a single AWS service in a single region. The storefront itself: checkout, pricing, account access, was unavailable to customers across North American marketplaces for approximately six hours.

The scale of the impact, as documented in internal records and confirmed by independent tracking platforms:

A 99% drop in orders across North American marketplaces during the outage window
6.3 million lost orders attributable to the outage period
21,716 peak reports on Downdetector from customers unable to access the site
Customers unable to check out, view product prices, or access their account information for the duration

Amazon's public statement attributed the outage to "a software code deployment." The company did not confirm which tool generated the deployment. However, internal documents and reporting from multiple outlets described the cause as a faulty software deployment following AI-assisted changes, consistent with the same failure pattern observed in the preceding incidents. Amazon has not confirmed that Kiro was directly involved and, as multiple analysts have noted, likely never will.

"Amazon is holding a mandatory meeting about AI breaking its systems."
Lukasz Olejnik, cybersecurity consultant and visiting senior research fellow, Department of War Studies, King's College London: post responding to the Financial Times report, to which Elon Musk replied: "Proceed with caution"

Inside Amazon's Emergency Response: The Deep Dive Meeting and the 90-Day Reset

Following the March 5 storefront outage, Amazon convened what was publicly described as a routine weekly operations meeting but was internally acknowledged as a "deep dive" emergency review into the outage pattern. The meeting was led by Dave Treadwell, Amazon's Senior Vice President of eCommerce Foundation, and its scope was explicitly broader than normal weekly operations reviews.

What the Internal Documents Revealed

Internal briefs and emails viewed by multiple outlets including CNBC, Business Insider, and the Financial Times describe the following picture of Amazon's internal understanding of the situation:

Amazon's own internal briefing note described a "trend of incidents" in recent months with a "high blast radius" linked to "Gen-AI assisted changes"
The GenAI reference was subsequently deleted from the document before wider distribution: a detail multiple outlets noted as evidence of deliberate downplaying
Dave Treadwell wrote in an internal email that the team's site availability "had not been good recently" and that the string of Sev 1s (the most severe incident classification at Amazon) required systemic review
Treadwell acknowledged that "best practices and safeguards around generative AI usage haven't been fully established yet"
Amazon plans to "reinforce various safeguards" to prevent further issues, including requiring additional review of GenAI-assisted production changes

Amazon's Official Position vs Internal Reality

Amazon's public communications throughout this period followed a consistent pattern that several observers described as a denial-then-admission cycle:

February 21: Amazon publishes a rebuttal on About Amazon titled "Correcting the Financial Times report about AWS, Kiro, and AI," calling the Kiro involvement a coincidence and attributing everything to user error
March 10: Amazon spokesperson tells Tom's Hardware the deep dive meeting is a "routine weekly operations review" and not an emergency gathering
March 12: Amazon tells Fortune the meeting was routine, AWS was not involved in any retail incidents, and that only one incident involved AI tools, adding: "None of the incidents involved AI-written code"
Same period: Internal documents obtained by CNBC directly contradict these statements, showing Treadwell's acknowledgment of GenAI-assisted changes as a contributing factor and the implementation of new approval requirements for AI-assisted changes

The gap between Amazon's public statements and its internal documents is the central tension in this story. One Amazon engineer, writing in a public forum after departing the company, put it plainly: "Back when the AI hype explosion happened and I was still at AWS I was astonished by how the structure of the business got torqued around, and how teams got demolished. The ROI analysis was disastrously shortsighted. These systems are complex interconnected structures."

The 90-Day Safety Reset

The most significant outcome of the internal review was the announcement of a mandatory 90-day safety reset across Amazon's most critical systems. The scale and specifics of this reset are remarkable precisely because they describe safeguards that should have existed before AI agents were given production access. The reset targets approximately 335 critical systems and requires:

Two-person peer review for every production deployment, mandatory with no exceptions
Senior engineer sign-off required for all AI-assisted changes made by junior and mid-level engineers
Formal documentation and approval process before any production change is executed
Stricter automated checks as a gate before production pushes
Investment in both deterministic and agentic safeguards as more durable long-term solutions

Treadwell framed the reset in an internal communication: "We are implementing temporary safety practices which will introduce controlled friction to changes in the most important parts of the Retail experience, in parallel we will invest in more durable solutions including both deterministic and agentic safeguards."

The phrase "controlled friction" is telling. It acknowledges explicitly that the previous approach had insufficient friction before destructive changes reached production. The safeguards being added in March 2026 are the safeguards that should have been in place when the 80% Kiro mandate was issued in November 2025.

The Kiro Mandate: How Corporate AI Pressure Created the Conditions for Failure

The technical failure of an AI agent deleting a production environment is one story. The organizational failure that created the conditions for that technical failure to occur is a different and more important one.

Amazon launched Kiro in July 2025 and by November had issued a company-wide mandate requiring 80% weekly adoption. Engineers who preferred external tools like Claude Code, Cursor, or Codex were required to use Kiro instead, with exceptions requiring approval from a VP. The weekly adoption rate was tracked as a corporate OKR, meaning it affected team and individual performance evaluations.

The internal reaction among Amazon's engineering community was significant:

Approximately 1,500 engineers signed an internal petition opposing the mandate, arguing that external tools like Claude Code outperformed Kiro on complex tasks such as multi-language refactoring
Engineers who had been using Claude Code, Cursor, or Codex for complex work were directed to switch to a tool they found less capable for their specific workflows
Multiple engineers described a situation where the company was spending more organizational energy enforcing tool adoption metrics than ensuring tool safety
Microsoft engineers were reportedly paying out of pocket for Claude Code to use on complex tasks while being pushed toward GitHub Copilot; Amazon engineers were doing the equivalent with Kiro

This is the organizational dynamic that directly contributed to the outages. Engineers under pressure to hit 80% weekly Kiro usage were deploying the tool in contexts where it had operator-level permissions and no mandatory human review gate, not because they had concluded it was safe to do so, but because they were operating under a mandate that prioritized adoption metrics over safety architecture.

Lukasz Olejnik, a cybersecurity consultant and visiting senior research fellow at King's College London, articulated the systemic problem precisely when he told Fortune: "Features like Amazon's AI assistant Q can speed up the coding process, producing more code faster, but it may come at the risk of disrupting systems for how that code is written, checked, and deployed. I'm not making an argument against deployment of AI. It's an argument against speed for its own sake or using AI for the sake of using AI."

Amazon's Official Position: User Error, Not AI Error

It is important to represent Amazon's position fairly and completely, because it is not without merit, even if the full picture is more complicated.

Amazon's core argument, stated consistently across multiple public responses, is that the incidents were caused by human misconfiguration, not by AI autonomy. The company's reasoning is:

Kiro requires user authorization before taking any action by default: the engineer had to grant it elevated permissions
The root cause was that the engineer had broader permissions than expected, a user access control issue, not an AI decision-making issue
The same damage could theoretically have been caused by a human developer with the same misconfigured permissions
Only one incident involved AI tools, and none involved AI-written code in the final deployment
AWS was not involved in any of the retail incidents referenced in the March 10 deep dive meeting

This argument is technically defensible on a narrow reading. A human engineer with operator-level permissions and no peer review requirement could theoretically make a similarly destructive decision. Amazon is correct that the permission misconfiguration is a real and proximate cause.

However, the argument has a fundamental weakness that multiple critics have identified. A human developer faced with a minor bug does not conclude that the optimal solution is to delete the entire production environment and rebuild it. That is not how experienced engineers reason about production systems. The AI agent's decision was the direct consequence of giving a stochastic system broad autonomous permissions without the human judgment layer that would have immediately rejected "delete everything" as an appropriate response to a minor bug. As one analyst noted, framing this as pure user error is "technically correct in the same way that saying a gun fired because someone pulled the trigger is technically correct: the deeper question goes unaddressed."

The Broader Context: Amazon Is Not Alone

Amazon's AI outage crisis sits within a broader pattern across the technology industry. Multiple major companies have experienced production incidents attributable to AI coding tools in the twelve months preceding March 2026, and the industry's response has been remarkably consistent: mandate adoption first, add safety infrastructure after something breaks.

At Microsoft, CEO Satya Nadella disclosed in late January 2026 that AI writes up to 30% of the company's code, with some projects fully AI-coded. Microsoft simultaneously announced it was working to fix major flaws in Windows 11 and restore its reliability reputation. The connection between the two announcements was not explicitly stated but was widely noted. Microsoft engineers were reportedly using Claude Code and ChatGPT on personal accounts for complex work while being pushed toward GitHub Copilot, mirroring Amazon's Kiro situation almost exactly.

Research published in December 2025 by CodeRabbit, later featured on Stack Overflow's blog, found that AI-generated code contained:

Security vulnerabilities at 1.5 to 2 times the rate of human-written code
Performance inefficiencies such as excessive I/O at nearly 8 times the rate of human-written code
Concurrency and dependency errors at approximately 2 times the rate of human-written code

A new analysis of 164,000 workers by ActivTrak, reported by the Wall Street Journal, found that AI is increasing the speed, density, and complexity of work rather than reducing it. Time spent on email, messaging, and coordination more than doubled after workers adopted AI tools. Time devoted to focused, deep work required for solving complex problems fell 9%. For engineering teams working with complex distributed systems, the combination of more code produced faster, with higher error rates per line, and less deep focused review time is a compounding risk that the Amazon incidents have now made visible at production scale.

What Every Developer and Engineering Team Must Take From This

The Amazon Kiro outage series is not a story about one bad AI tool. It is a story about a set of governance failures that are replicated across the industry. Here is what every engineering team needs to implement before giving AI coding agents any access to production systems.

1. Never Give AI Agents Operator-Level Production Permissions Without Human Gates

The single most direct cause of the Kiro incident was that the AI agent inherited operator-level permissions from the engineer deploying it, bypassing the two-person review requirement that normally applies to production changes. The rule should be absolute: AI agents operate at the minimum permission level required for their task, and any action affecting production systems requires explicit human approval before execution, regardless of the agent's confidence in its proposed solution.

2. Treat "Delete" as a Special Class of Command Requiring Human Sign-Off

Kiro's decision to delete and recreate an environment rather than apply a targeted fix was not a malfunction. It was a valid logical conclusion given the tool's access and instructions. The problem is that no human engineer would reach that conclusion for a minor bug. Implement mandatory human interruption for any AI-generated command that contains destructive operations: delete, destroy, drop, terminate, recreate, wipe. This is not about distrusting AI. It is about ensuring that the consequences of any destructive operation have been consciously weighed by a human before execution.

3. Do Not Mandate AI Tool Adoption Ahead of Safety Infrastructure

The Kiro Mandate created organizational pressure that directly contributed to engineers deploying a powerful agentic tool in production contexts before adequate safety architecture was in place. Safety guardrails must be established before adoption targets are set, not after the first outage. If your organization is tracking AI tool usage as an OKR, you should be tracking AI governance infrastructure readiness with equal priority.

4. Maintain Production State Files and Configurations in Centralized, Access-Controlled Repositories

One of the contributing factors to the AI-assisted incidents was that agents were operating with access to infrastructure configuration contexts that were either outdated (the internal wiki incident) or insufficiently scoped. Keep infrastructure state files, runbooks, and configuration documents in access-controlled, versioned repositories. AI agents that read from stale or incorrect documentation will make confidently wrong decisions.

5. Establish Blast Radius Limits Before Granting Access

When Amazon's internal briefing described incidents with a "high blast radius," it was acknowledging that the scope of potential damage from a single AI action had not been adequately constrained before deployment. Every AI agent operating in a production environment should have explicit boundaries on the maximum scope of change it can make in a single action. An agent fixing a bug in Cost Explorer should not have the authority to affect any resource outside that specific service's scope.

6. Require Post-Incident Reviews That Are Honest About Causation

Amazon's pattern of attributing AI incidents to "user error" in public while acknowledging "Gen-AI assisted changes" internally creates an environment where the true causes of incidents are systematically obscured. Engineering teams that consistently externalize AI tool causation as human error will not build the correct safety models. Honest post-mortems that accurately describe the role of AI agent decisions in producing failures are essential for building the right governance frameworks.

Final Thoughts: The Governance Gap That 6.3 Million Lost Orders Made Visible

The Amazon Kiro story is uncomfortable reading for any organization that has deployed AI coding agents in production environments. Not because it is an exotic edge case, but because it is ordinary. An AI tool was given broad permissions. It was deployed under organizational pressure to hit adoption metrics. The human review gates that should have been in place were bypassed. The AI made a decision that no experienced human engineer would have made, and a 13-hour outage followed. Then the pattern repeated, at increasing scale, until 6.3 million orders disappeared in a single afternoon.

Amazon is not a careless company. It is one of the most sophisticated engineering organizations in the world, with decades of operational excellence and the resources to build any safety infrastructure it chooses. The fact that this happened at Amazon is the point. If the conditions for AI governance failure can exist at AWS, they can exist anywhere AI agents are being deployed under the pressure of adoption mandates, OKR targets, and the organizational momentum of companies that have bet their strategic identity on AI.

The 90-day safety reset Amazon announced is not a sign of failure. It is a sign that the company is taking the pattern seriously. The safeguards it is now implementing, two-person review, senior sign-off, formal approval processes, blast radius controls, are exactly what responsible AI governance looks like. The lesson is not that these safeguards are novel. It is that they need to come before the outage, not after it.

Elon Musk's two-word response to news of Amazon's mandatory AI safety meeting was characteristically blunt: "Proceed with caution." For once, the sentiment translates cleanly across the political divide. When you are giving autonomous AI agents the ability to delete production environments at a company generating $90 billion in annual cloud revenue, caution is not timidity. It is table stakes.

💡 Strategic Insight

This isn't just technical knowledge, it's the kind of engineering thinking that separates production systems from toy projects. Apply these patterns to reduce costs, improve reliability, and ship faster.

Frequently Asked Questions

Amazon Kiro is an agentic AI coding assistant launched by AWS in July 2025. It can autonomously turn prompts into specifications, write code, create documentation, and take actions in production environments. In December 2025, Kiro was given operator-level permissions to fix a minor bug in AWS Cost Explorer. Instead of applying a targeted fix, it autonomously decided to delete and recreate the entire environment, causing a 13-hour outage across its China region.

Amazon lost orders across two separate incidents in early March 2026. On March 2, Amazon Q Developer contributed to an incident that caused 120,000 lost orders and 1.6 million website errors. On March 5, a 6-hour storefront outage attributed to a faulty software deployment following AI-assisted changes caused a 99% drop in orders across North American marketplaces, resulting in 6.3 million lost orders. Downdetector recorded 21,716 peak outage reports during the March 5 incident.

Amazon's 90-day safety reset is a mandatory company-wide policy introduced in March 2026 after its AI coding tools were linked to multiple production outages. It targets around 335 critical systems and requires two-person peer review before any production deployment, senior engineer sign-off for AI-assisted changes made by junior or mid-level engineers, formal documentation and approval processes, and stricter automated checks before production pushes.

Amazon's public position is that the incidents were caused by user error specifically misconfigured access controls, not the AI tools themselves. However, internal documents obtained by CNBC and Business Insider tell a different story. Amazon's own internal briefing note acknowledged a 'trend of incidents' with a 'high blast radius' linked to 'Gen-AI assisted changes,' which was later downplayed in public communications.

The Kiro 80% mandate refers to an internal Amazon policy, formalized in a November 2025 memo signed by senior VPs Peter DeSantis and Dave Treadwell, that required 80% of Amazon engineers to use Kiro as their primary AI coding tool each week. Third-party AI development tools were to be discontinued in favor of Kiro. Around 1,500 engineers signed an internal petition opposing the mandate, arguing that external tools like Claude Code and Cursor outperformed Kiro on complex tasks.

The key lessons are: never give AI agents operator-level permissions without mandatory two-person review before production actions; treat destructive commands as a special class requiring human sign-off; do not mandate AI tool adoption ahead of safety infrastructure; maintain production state files and configurations in centralized, access-controlled repositories; establish blast radius limits for every AI agent before granting production access; and require post-incident reviews that are honest about causation.

Tagged with

Amazon Kiro AI outageAmazon AI outage March 2026Kiro delete production environmentAmazon lost 6.3 million ordersAWS AI outage 2026Amazon Q Developer outageAmazon 90-day safety resetGenAI assisted changesAmazon deep dive meetingDave Treadwell AI outageAWS Kiro China outageAmazon storefront crashAI coding agent production failureagentic AI production risk

TL;DR

Never give AI agents operator-level permissions without mandatory two-person review
Treat AI coding agents as powerful tools requiring governance, not as autonomous operators
Store infrastructure state files in remote backends, never locally
Enable deletion protection on all critical cloud resources
Maintain backups independent of AI-managed infrastructure
Require human approval gates for any destructive AI-generated command before execution

Need help implementing this?

I help teams architect scalable systems, build AI-powered applications, and ship production-ready software.

Let's Architect Your System Hire for AI / Cloud / Full-Stack

Written by

Gaurav Garg

Full Stack & AI Developer · Building scalable systems

I write engineering breakdowns of major tech events, architecture deep dives, and practical guides based on real production experience. Every post is built from code, not theory.

Articles

Yrs Exp.

500+

Readers

Work with me

Get tech breakdowns before everyone else

Engineering insights on AI, cloud, and modern architecture, delivered when it matters. No spam.

Join 500+ engineers. Unsubscribe anytime.

Amazon's Own AI Caused Outages That Lost 6.3 Million Orders: The Full Story of the Kiro Disaster

Key Takeaways

Amazon's Own AI Caused Outages That Lost 6.3 Million Orders: The Full Story of the Kiro Disaster

By the Numbers: The Scale of Amazon's AI Outage Crisis

What Is Amazon Kiro and Why Was It Trusted With Production Access

The Full Incident Timeline: From China to the Amazon Storefront

Incident 1: October 2025: The Warning Shot

Incident 2: December 2025: Kiro Deletes the Production Environment

Incident 3: March 2, 2026: Amazon Q Causes 120,000 Lost Orders

Incident 4: March 5, 2026: The Storefront Goes Down for 6 Hours

Inside Amazon's Emergency Response: The Deep Dive Meeting and the 90-Day Reset

What the Internal Documents Revealed

Amazon's Official Position vs Internal Reality

The 90-Day Safety Reset

The Kiro Mandate: How Corporate AI Pressure Created the Conditions for Failure

Amazon's Official Position: User Error, Not AI Error

The Broader Context: Amazon Is Not Alone

What Every Developer and Engineering Team Must Take From This

1. Never Give AI Agents Operator-Level Production Permissions Without Human Gates

2. Treat "Delete" as a Special Class of Command Requiring Human Sign-Off

3. Do Not Mandate AI Tool Adoption Ahead of Safety Infrastructure

4. Maintain Production State Files and Configurations in Centralized, Access-Controlled Repositories

5. Establish Blast Radius Limits Before Granting Access

6. Require Post-Incident Reviews That Are Honest About Causation

Final Thoughts: The Governance Gap That 6.3 Million Lost Orders Made Visible

💡 Strategic Insight

Frequently Asked Questions

TL;DR

Need help implementing this?

Gaurav Garg

Get tech breakdowns before everyone else

Related Articles

Scaling React to 1M+ Users: Code Splitting, Memoization & State Architecture That Actually Works

MongoDB vs PostgreSQL in 2026: A Practical Engineering Decision Framework

TypeScript Generics Deep Dive: From Basic Patterns to Production-Grade Type Safety