How We Hit 83.4% on SWE-bench Verified (Part 1): Getting Reproduction Right

We recently tested an AI debugging methodology on SWE-bench Verified and achieved a combined pass rate of 83.4%. Our overview post covers the full methodology, results, and high-level thinking — if you haven't read it yet, that's a good place to start.

The methodology breaks down into three stages: reproduce the bug → generate a fix → verify the fix is trustworthy. This series walks through each stage and explains how runtime facts guide the AI toward the right answer at every step.

This first post covers Stage 1: How do you make sure a bug reproduction is actually correct before you touch any code? The short answer: before writing a single line of fix code, we have the agent collect a set of trusted runtime facts and use them to verify the reproduction actually matches the issue description.

What Are Runtime Facts?

Runtime facts are all the observable data produced while a program is running — debug traces, logs, object state snapshots, and exception information. We instrument code automatically using a modified OpenTelemetry probe to collect this data. The most important piece is the debug trace, so let's take a moment to understand what that means.

A debug trace records an entire execution run: which functions were called and in what order, what arguments each function received, what each function returned, and where exceptions were thrown or caught.

You've probably seen a stack trace before — that wall of error text that appears when a program crashes. A stack trace is a small slice of a debug trace. It only records the path an exception took as it bubbled up through the call stack, answering "how did this error surface?" A debug trace captures the full picture: starting from the entry point, every step along the entire call chain, not just the part that went wrong.

Here's a concrete example. Say a user reports "calling create() to save data throws an error":

# Stack trace — tells you how the error surfaced:
File "db/models/sql/compiler.py", line 1553, in execute_sql
  AttributeError: 'NoneType' object has no attribute 'id'
File "db/models/query.py", line 1802, in _insert
  return compiler.execute_sql(...)

# Debug trace — tells you what happened along the entire path from the entry point:
QuerySet.create(kwargs={"name": "test"})
  |- Model.__init__(kwargs={"name": "test"})   # Args look fine
  |- Model.save(force_insert=True)
    |- Model.save_base()
      |- QuerySet._insert()
        |- SQLInsertCompiler.execute_sql()      # Crashes here, but why?
            return value was None               # Return value is None — something upstream is wrong

With a debug trace, you don't just see where things crashed. You see at what point the data started going wrong.

Our entire methodology is built on these runtime facts rather than reading code and guessing. The principle is simple: get the facts straight before touching the code.

Why This Matters: The Most Common Failure Mode Is Going in the Wrong Direction

The typical AI bug-fixing flow looks like this: write a reproduction script reproduce_issue.py, run it, watch it fail, then start modifying the codebase to fix whatever the script complains about.

This seems reasonable on the surface. But it routinely leads to three categories of failure:

Failure 1: The script itself has a bug. The script fails, but the failure is caused by a mistake in the script, not by the bug described in the issue. The agent now has a broken signal and starts making code changes in the completely wrong direction.

Failure 2: The reproduction takes the wrong path. The script calls a deep internal function directly, bypassing the path a real user would take. This does trigger an error, but the call chain is incomplete — the agent only sees a fragment of the problem.

Failure 3: The agent patches symptoms instead of causes. The agent sees which line throws the error and adds a defensive check right there — a type guard, a null check, a conditional return — and calls it done. The test passes. But the real problem — an upstream function passing bad data — was never fixed. A different trigger path will surface the same bug again.

The third failure is the most insidious. If execute_sql() receives None, the agent might add if value is None: return and move on. But the root cause is that something higher up in the call chain returned a bad value. That patch just hides the symptom.

This is why we redefined the goal of the Reproduce stage: prove that this failure aligns with what the issue describes — not just that something, somewhere, is failing.

How We Make Reproduction More Reliable

Step 1: Capture a Full Debug Trace, Then Let the Agent Query It Selectively

After running reproduce_issue.py, the system captures a complete debug trace. But we don't dump the entire trace into the agent's context window.

The reason is practical: a single run can produce hundreds of lines of trace output, most of which is irrelevant noise. Hand all of that to an agent and it tends to get lost in the weeds, which hurts accuracy.

Instead, the system returns a short summary of the run, and the agent uses a tool called trace_query.py to pull specific data on demand. The agent investigates like a detective — forming a concrete question, then going to the evidence:

overview — Get the big picture: were there exceptions? What were the key calls?
args <function> — What exact arguments did this function receive?
callers <function> — What called this function?
chain <function> — What's the full call chain for this function?
downstream <function> — Who consumed the return value of this function?

These map directly to the natural debugging process: start broad, look at arguments, trace upstream and downstream. The key benefit is that every piece of information the agent sees is something it specifically asked for, directly relevant to the problem at hand — not something it happened to stumble across in a pile of raw logs.

Step 2: Reproduce via the Real User Path First, Not Internal Functions Directly

We explicitly require reproductions to follow the actual user-facing path described in the issue. For example:

"Run the migrate command"
"Call this public API method"
"Make a request to this URL"

This matters more than it might seem. How you trigger a bug determines where the agent focuses its attention.

If the script calls a deep internal function directly, the agent naturally assumes "the problem is in this function" and patches it defensively. But that function might be perfectly fine — it just received bad input from somewhere above it. Routing the reproduction through the real entry point exposes the complete call chain, which is what makes accurate root cause analysis possible.

Step 3: Validate the Reproduction Before Touching Any Code

Once the reproduction script runs, the system automatically checks whether it's trustworthy. This validation is essentially a fact-alignment check:

Does the failure match what the issue describes? Is it an exception (crash) or unexpected behavior?
Does the debug trace contain a meaningful number of internal project calls, or is the script just failing inside itself without ever reaching project code?
Does the trace cover the functions and code paths mentioned in the issue?
Did the execution go through the expected entry point?
Does the exception type and error message match what the issue describes?

If these conditions aren't met, the system blocks the agent from making code changes and gives specific feedback. For example: "The current trace shows the exception is thrown inside test_helper.py and never enters project code. The reproduction path is wrong — fix the script's trigger before proceeding."

This turns "start hacking on code and see what sticks" into "get your inputs right first."

Step 4: Code Changes Only Happen After Reproduction Is Validated

Only once the reproduction passes validation does the system allow the agent to start modifying code. That means any patch generated from this point is grounded in verified runtime facts — not in noise, a broken script, or an incorrect trigger path.

What This Stage Actually Solves

The Reproduce stage is fundamentally about ensuring the quality of the debug input:

It turns the debug trace from passive noise into an active, queryable source of facts.
It keeps the trigger path close to real user behavior instead of bypassing it.
It uses a validation gate to catch bad reproductions before the agent burns time going in the wrong direction.

This is why the stage can feel slow — but it's why overall pass rate goes up. Getting the facts right before writing the patch is almost always faster in total than writing a patch, watching it fail, and debugging your debugging.

Next up: Stage 2, Generating the Fix. Once the agent has verified runtime facts, how does it use them to identify the root cause and generate a patch? And why is "having a debug trace" and "knowing how to use a debug trace" two very different things?