Essential Complexity

Modernizing high-risk systems in the age of AI.

4 min read

The Reliability Gap

Amazon's 90-day reset is not an engineering strategy. It is an admission.

The Reliability Gap
Joe Leo
Joe Leo

Founder, Def Method

Last week, Amazon ordered a 90-day code safety reset across its most critical engineering systems.

The details are worth sitting with. On March 2nd, a failure caused nearly 120,000 cancelled orders and 1.6 million errors across Amazon's websites. Three days later, a second outage caused a 99 percent decline in orders in North American marketplaces. An internal memo from Amazon's SVP of e-commerce services identified a pattern of problems tied to GenAI-assisted changes dating back to Q3 2025.

Amazon has a gap between how fast AI can generate change and how fast its systems can safely absorb it.

The company mandated that 80 percent of its engineers use its AI coding assistant weekly. Development velocity increased. Reliability infrastructure didn't. The result was a pattern of high blast radius incidents that took down ordering systems twice in three days. Amazon is now requiring two reviewers to approve any major code change across 335 Tier-1 systems, and is introducing what it calls "controlled friction" to slow down risky changes in core retail systems.

The company with nearly unlimited engineering resources, facing a problem it created with its own tools, reached for a slowdown.

Controlled Friction Is Not a Strategy

Controlled friction is not an engineering strategy. It is an admission that the infrastructure to safely absorb AI-generated change doesn't exist yet, even at Amazon.

In The Bull Case for Ambition, I argued that abundant intelligence doesn't eliminate the constraints of software product development. It exacerbates them. Faster code generation without deeper test coverage increases blast radius. Automated workflows layered onto unclear compliance logic increase regulatory exposure. Intelligence improves locally while systemic risk accumulates globally.

Amazon is that argument, at scale, in production.

Amazon's own internal language for what happened is telling. Executives described the incidents as "high blast radius changes" in which updates were disseminated across systems without adequate protection. The vocabulary of reliability engineering was already there. The infrastructure to enforce it was not.

The Shape of the Gap

The reliability gap moves the productivity bottleneck from writing software to trusting it. Most organizations have not yet registered the shift.

The gap has a specific shape. Testing is still largely human-paced. Observability tooling was designed for systems where humans made the changes. Failure isolation was architected around development cycles measured in weeks, not minutes. None of that has changed as fast as the velocity of AI-generated code.

Closing the Gap

Closing the gap looks different than adding reviewers. Test coverage must evolve as fast as the code it covers. Observability must track not just what the system did, but how confident the agent was when it decided. Deployment pipelines must route changes by risk profile automatically, rather than relying on human reviewers to catch what automated systems missed. Reliability must be treated as an architectural property of the system, not a process layer applied on top of it.

James Gosling, the lead designer of Java and a former distinguished engineer at AWS, put it plainly after a different major AWS outage last year: "These systems are complex interconnected structures. Unless the whole ecosystem is comprehended in total, bad decisions are made."

That comprehension is what reliability engineering provides. And it has not kept pace with the tools generating the complexity.

The Question Worth Watching

The companies that recognize this early will not slow down. They will build the infrastructure that makes speed safe. The ones that don't will keep reaching for controlled friction, adding reviewers and approval gates, and discovering that human bottlenecks do not scale any better the second time around.

Amazon's 90-day reset buys time. What it builds in those 90 days is the question worth watching.

The reliability gap is not a reason to slow down AI adoption. It is the next problem worth being ambitious about.

Ready to modernize your Rails system?

We help teams modernize high-stakes Rails applications without disrupting their business.

If this was useful, you might enjoy Essential Complexity — a bi-weekly letter on modernizing high-risk systems in the age of AI.