How We Hit 83.4% on SWE-bench Verified (Part 2): Finding the Root Cause and Generating the Fix

We recently tested an AI debugging methodology on SWE-bench Verified and achieved a combined pass rate of 83.4%. Our overview post covers the full methodology, results, and high-level thinking — if you haven't read it yet, that's a good place to start.

The methodology breaks down into three stages: reproduce the bug → generate a fix → verify the fix is trustworthy. This series walks through each stage and explains how runtime facts guide the AI toward the right answer at every step.

Part 1 covered the Reproduce stage: before touching any code, the agent runs the program to collect real call chains and argument data — runtime facts — so it's working from evidence instead of guesswork.

This post answers one question: once you have those runtime facts, how do you make sure the agent changes the right code?

A lot of AI agents don't fail because they can't write a patch. They fail because they write the patch too early. The agent sees where the error is thrown, immediately adds a defensive check, or makes a local fix around the reproduction script, and declares victory. It looks fixed — until a different trigger path surfaces the same bug. This is the classic wrong fix.

The goal of the Generate Fix stage is to ensure the agent only modifies code when the evidence is solid and the direction is correct.

A Quick Recap: What Are Runtime Facts?

If you skipped Part 1, here's the short version.

Runtime facts are all the observable data produced while a program is running: debug traces, logs, object state snapshots, and exception information. A debug trace records an entire execution run — which functions were called and in what order, what arguments each function received, what each function returned, and where exceptions were thrown or caught. We collect this automatically using a modified OpenTelemetry probe.

With runtime facts, every judgment the agent makes can point to a specific piece of evidence. This is the foundation the whole system is built on.

Prove It First, Then Fix It: Code Changes Require a Hypothesis Card

In our system, making code changes isn't a default permission. Before the agent can touch any code, it must complete a hypothesis card based on the runtime facts it has collected.

The hypothesis card requires the agent to nail down three things:

What is the root cause? (supported hypothesis)
Which specific evidence in the trace supports this conclusion?
What other explanations looked plausible but were ruled out by the evidence? (rejected hypothesis)

That third item is the most important one. If you only ask an agent to explain why it believes something, it's easy for it to rationalize its way to a conclusion. "This function received None, so I'll add a null check here" — that logic sounds coherent, but it skips the more important question: why was None being passed in the first place?

Requiring a rejected hypothesis forces the agent to actively argue against its own conclusion. It's not enough to say "here's why I believe this." It also has to say "here's what I considered and ruled out, and why."

Here's a concrete example. Say the issue is "calling translate_url() returns the wrong result":

# A passing hypothesis card looks like this:

supported hypothesis:
  Root cause is normalize() returning None when handling URLs with a prefix,
  instead of returning the processed path string.
  Evidence: trace shows normalize() return value is None,
  and downstream reverse() uses this return value directly.

rejected hypothesis:
  The bug is not in translate_url()'s own logic.
  Evidence: trace shows translate_url() receives correct input arguments;
  the problem occurs after it calls normalize().

Only when the hypothesis card meets the quality bar does the system allow code changes. If the evidence is thin, the system tells the agent exactly what's missing — for example, "no data on upstream callers of normalize(), run callers normalize first" — rather than letting it proceed on gut feeling.

Bug Type Determines Fix Strategy: `wrong-arg` Must Be Traced to the Source

Even after the agent has the green light to make changes, it can't just pick wherever it wants to fix things. We first do bug class routing, classifying the problem into one of three categories:

wrong-arg: A function received a bad argument, but the bad value was produced upstream
missing-handler: A certain type of input isn't handled and logic needs to be added
logic: The function's own processing logic is incorrect

Each category calls for a different fix strategy. The one most prone to wrong fixes is wrong-arg.

A wrong-arg bug usually looks like a problem at the crash site: some function received a value it shouldn't have. But the crash site is typically just the consumer — it used the bad value, but didn't produce it. The function that produced the bad value is somewhere upstream.

Here's an example: execute_sql() receives None. But None was passed in by QuerySet._insert(), which got it from save_base() returning the wrong result in the first place. If the agent adds if value is None: return inside execute_sql(), the error disappears on the surface — but the root cause is untouched, and the same bug will resurface through a different code path.

So for any wrong-arg bug, we require the agent to run a 5-Whys analysis before it's allowed to write a single line of fix code.

5-Whys is a classic root cause analysis technique: start from the visible symptom, keep asking "why," and treat each answer as the starting point for the next question — until you reach the actual source of the problem. In a debugging context, it forces the agent to keep tracing upstream instead of stopping at the first explanation that sounds reasonable.

Using the same execute_sql() example, here's what that chain of reasoning looks like:

Why does execute_sql() throw an error?
→ Because the `values` argument it receives is None, but it expects a list.

Why is `values` None?
→ Because QuerySet._insert() passed None into it.

Why did _insert() pass None?
→ Because the return value it got from save_base() was None.

Why did save_base() return None?
→ Because one of its internal branches calls _do_insert() but has no return
  statement, so Python implicitly returns None.

Why does that branch have no return statement?
→ It was added recently, the return was forgotten, and there's no test
  covering this code path.

By the fifth level, the actual root cause surfaces: a missing return statement in a branch inside save_base(), not anything wrong with execute_sql(). If the agent had stopped at the first level and added a null check, the bug would have been suppressed, not fixed.

All five levels must be resolved before the agent is allowed to start writing code. This is the core mechanism that prevents "add a null check at the crash site and call it done."

Preventing Fix Drift: Every Change Must Stay in the Focus Zone

Another common failure mode is fix drift: the agent analyzes in the right direction, then writes a patch that modifies unrelated code.

For example, the issue is about QuerySet.create() failing to save data. The debug trace points to SQLInsertCompiler.execute_sql() as the root cause. But the agent ends up modifying a helper function in test_utils.py, or touching a general utility that has nothing to do with this call chain. The tests might still pass — but the change has no causal relationship to the actual issue.

We handle this in the Implement phase with focus alignment. The issue description, the reproduction path, and the key functions in the trace together define a "focus zone." The first code change must land inside that zone. If a patch falls outside it, the system blocks the change and requires the agent to explain the causal link between its modification and the root cause it identified.

This doesn't ban changes to surrounding code entirely. It just requires the agent to justify why those changes are necessary before proceeding.

Preventing the Agent from Spinning Its Wheels

There's another subtle failure mode: the direction is right, but the agent keeps doing unproductive things. It queries the same trace data repeatedly, re-reads the same files, or keeps creating new probe scripts even though it already has enough information to act.

This is like an engineer who keeps flipping through the same documentation but never makes a decision. Time and context window both get consumed, and actual progress stalls.

We address this with anti-loop guards. For redundant trace queries, repeated file reads, and unnecessary probe scripts, the system blocks the action and returns a suggested next step instead. This keeps the agent from falling into a pattern of "collecting more information" while making no real decisions.

What the Generate Fix Stage Actually Produces: An Explainable Patch

Taken together, these controls do one thing: they turn code writing from a freestyle activity into a structured, evidence-based decision process.

The hypothesis card requires both supporting and refuting evidence, preventing the agent from arguing itself into a bad conclusion
wrong-arg bugs must be traced to the upstream source, preventing surface-level patches at the crash site
Every change must pass focus alignment, preventing fix drift
Actions are checked against anti-loop guards, preventing unproductive cycles

The goal of this stage is for every code change to be traceable back to specific evidence in the runtime facts.

After the patch is generated, there's still one more trap that's easy to miss: passing tests doesn't mean the bug is actually fixed. That's what the next post covers — how the Validate stage uses runtime facts to determine whether a patch is a real fix or just a convincing fake, and how failures get turned into useful input for the next iteration.