Essential Complexity

Modernizing high-risk systems in the age of AI.

4 min read

Detection Won't Save You

Detection finds problems. Containment decides whether they become incidents.

Detection Won't Save You
Joe Leo
Joe Leo

Founder, Def Method

For the past year, the debate about AI-generated code and security has been conducted almost entirely in the language of probability. How likely is AI-generated code to contain a vulnerability? How much more likely than code written by a person? These numbers have been useful, and they have been largely ignored, because probability is abstract.

That changed in May 2025, when a research team at Georgia Tech's Systems Software and Security Lab launched a project called the Vibe Security Radar. It's built on a simple premise: stop estimating, start counting.

The early numbers were small enough to dismiss. Two confirmed cases in August 2025. Six in January 2026. Fifteen in February. Then March arrived. Thirty-five vulnerabilities directly attributable to AI-generated code, disclosed in a single month. The total confirmed count hit 74. The lead researcher, Hanqing Zhao, told Infosecurity Magazine that the real number is almost certainly five to ten times higher because most AI coding tools leave no detectable trace.

The count itself is not the most important number. The trajectory is.

What the Data Is Actually Telling Us

The explanation for why this keeps happening is structural: security pass rates for AI-generated code remain stuck at 55%, flat, while every other performance measure has improved. AI coding tools are trained on public repositories that contain both secure and insecure patterns, and they reproduce both with equal confidence. The training objective is functional correctness: does the code compile, does it pass the tests, does it do the thing the user asked? Security is not part of that feedback loop because from the model's perspective, there is nothing to fail.

The result is that the industry has optimized for the wrong signal. Models that are better at writing code are also better at writing plausible, functional, quietly insecure code — faster, at greater volume, with more confidence than a junior developer who might at least pause and wonder if they're missing something.

Apiiro measured what this looks like inside real organizations. Across Fortune 50 enterprises in the first half of 2025, AI-assisted developers committed code at three to four times the rate of their peers. Monthly security findings rose from roughly 1,000 to more than 10,000 over the same period. The velocity was real. So was everything that came with it.

The Wrong Mental Model

The industry's instinctive response to all of this is to improve detection. Better scanners. Faster review. More automation pointed at the same places we have always looked. The assumption underneath that response is that the problem is findability: if we could just surface vulnerabilities faster, we could stay ahead of them.

That assumption made sense when AI was a minor contributor to a codebase. It does not make sense when AI is generating 42% of all code — a number that surveys suggest will cross 50% by next year. At that point, detection stops being a control mechanism. It becomes a reporting function. It is how much vulnerability your systems can carry before something breaks.

This is a different kind of problem. Detection is about catching individual failures before they reach production. Containment is about the architecture of your systems and whether they are built in a way that limits how far a failure can travel.

The distinction matters because AI introduces a specific and predictable class of problems. Bad parsing. Edge case bugs. Incorrect assumptions about inputs that only manifest under unusual conditions. There is nothing novel about these failure modes. They are the ordinary mistakes of a very fast, very confident, inexperienced contributor, one that does not know what it does not know, and does not slow down when it is uncertain.

The question is not whether those mistakes will occur. At current adoption rates, they will occur constantly. The question is what happens when they do. Review scales linearly. Code generation does not. That gap only widens.

Building for Containment

In a well-architected system, an AI-introduced bug in one service does not cascade into an outage across five. A parsing error at an API boundary throws a handled exception rather than corrupting a database. A bad assumption about user input triggers an alert before it becomes a breach. The failure is localized, observable, and recoverable.

That is not a description of most systems today. Most production systems were designed around the assumption that the code entering them had been read, understood, and judged reasonable by a person. That assumption is now false at the system level, even if it remains true for individual commits.

The gap is where risk accumulates. Not in a single dramatic failure, but in a steady, quiet buildup of assumptions that were never validated, edge cases that were never tested, and boundaries that were never enforced because the review process was optimized for throughput, not containment.

Moving toward containment does not require rebuilding everything at once. It requires asking a different question during every architectural decision, every sprint review, every postmortem: not just "did we catch this?" but "if we hadn't caught it, how far would it have traveled?" Services that answer that question badly are the ones that need hardened boundaries first: explicit input validation, tighter failure modes, alerting that fires before a problem becomes visible to users. The goal is not a system where AI-introduced bugs never occur. It is a system where they occur with minimal consequence.

What Changes When the Numbers Are Real

The Vibe Security Radar matters not because of what it has found so far, but because of what it has made possible: for the first time, AI-introduced vulnerabilities are on the record — in the same databases that insurers, regulators, and legal teams already use to assess risk.

That is a different kind of evidence. A severity score and a list of affected systems lands differently in a board conversation than a benchmark pass rate — because it describes something that already happened to someone, not something that might happen to you.

The teams that read these numbers as a detection problem will respond by improving their scanners. They will find more vulnerabilities, faster. For a while, that will feel like progress.

The teams that read them as a containment problem will ask a different set of questions. Not just "are we catching enough?" but "when something slips through, does our architecture limit the damage?" Are failures localized to a service boundary or does the blast radius extend across the system? Are anomalies observable before they become incidents? Is recovery measured in minutes or days?

These are not tooling decisions. They are decisions about how much unverified behavior your organization is willing to run in production.

The CVE clock is not just ticking. It is compounding. And compounding problems do not wait for the organization to catch up. They force a correction on their own schedule.

Detection finds problems. Containment decides whether they become incidents.

Ready to modernize your Rails system?

We help teams modernize high-stakes Rails applications without disrupting their business.

If this was useful, you might enjoy Essential Complexity — a bi-weekly letter on modernizing high-risk systems in the age of AI.