How We Hit 83.4% on SWE-bench Verified (Part 3): Proving the Fix Actually Works

We recently tested an AI debugging methodology on SWE-bench Verified and achieved a combined pass rate of 83.4%. Our overview post covers the full methodology, results, and high-level thinking — if you haven't read it yet, that's a good place to start.

The methodology breaks down into three stages: reproduce the bug → generate a fix → verify the fix is trustworthy. This series walks through each stage and explains how runtime facts guide the AI toward the right answer at every step.

Part 1 covered the Reproduce stage: run the program first, collect real call chains and argument data as runtime facts, and use that evidence instead of guesswork.

Part 2 covered the Generate Fix stage: use those runtime facts to constrain the agent's behavior and prevent it from patching symptoms at the crash site instead of fixing the actual root cause.

This post answers the final question: once a patch is generated, how do you prove it actually fixed the bug — rather than just making the tests pass?

Here's the reality: a lot of patches fail because the validation bar is too low.

The reproduction script passes, but it never actually triggered the bug — it just bypassed the real code path. This is a bypass fix.
The bug case stops throwing an error, but normal behavior gets broken in the process. This is an over-fix regression.
The agent submits too early when things are "good enough," or keeps tinkering after the fix is already solid and accidentally breaks a working state.

So this stage has to answer three questions: Did this patch actually fix the issue? Did it break anything that was working before? Is it safe to commit right now?

A Quick Recap: What Are Runtime Facts?

If you haven't read Parts 1 and 2, here's the short version.

Runtime facts are all the observable data produced while a program is running: debug traces, logs, object state snapshots, and exception information. A debug trace records an entire execution run — which functions were called and in what order, what arguments each function received, what each function returned, and where exceptions were thrown or caught. We collect this automatically using a modified OpenTelemetry probe.

This is the foundation the whole system is built on. With runtime facts, every judgment points to specific data — not just a test's final exit code.

Why We Validate Against Runtime Facts, Not Just Test Results

Our validation isn't "run some tests and check the exit code." It's checking a runtime evidence chain — a chain of reasoning where every judgment points to specific data in the runtime facts.

The reason is straightforward: an exit code can tell you "this run passed." It can't tell you "did the fix go through the right code path," "did it quietly break something else," or "is it safe to commit now."

So in this stage, every judgment comes back to runtime facts: call paths in the trace, the volume of internal project calls, exception state, key object attribute snapshots, and the diff between pre-patch and post-patch baselines.

This turns "verification" from a results check into a causality check.

How It Works: Breaking "Fixed" Into Three Layers of Provable Evidence

Once the agent enters validate_fix, the system works through a chain of checks. Each layer has a concrete pass/fail criterion.

Layer 1: You're Fixing the Source Code, Not Fixing the Scoreboard

If the agent's changes only touch test files — or make tests pass by changing what the test expects — the system rolls back those changes and requires the agent to fix the root cause in the actual source code.

This catches the most basic form of false positive. Imagine the issue says "create() should return the object's ID, but it returns None instead." If the agent changes the test to expect None, the test goes green. The bug is completely untouched.

Layer 2: The Bug Is Fixed on the Real Code Path, Not Around It

reproduce_issue.py must go from failing to passing — and the debug trace must show a meaningful volume of internal project calls.

Why check call volume? If the trace shows almost no project code was executed, the script never actually walked the real code path. It ran, produced no error, and exited — but only because the bug was never triggered. The system doesn't accept "the script stopped complaining" as success. It requires "the problem was eliminated on the actual execution path."

Layer 3: Nothing That Was Working Before Is Now Broken

This is the happy-path baseline comparison. Before the patch, the system runs happy_path_test.py and records the normal execution state as a baseline. After the patch, it runs again and diffs the two.

The comparison isn't just "did it print success." It looks for unexpected changes in four areas:

Did key call anchors disappear in large numbers? (Indicates logic that should have run was skipped)
Did the volume of internal project calls drop significantly? (Indicates an execution path was accidentally cut short)
Did any new exceptions appear?
Did key objects end up in unexpected states after the patch?

That last point deserves more detail. We place a SNAPSHOT_ANCHOR on critical objects, capturing their attribute state before and after the patch. For example:

# Snapshot before patch:
User object after save():
  id: 1
  name: "test"
  is_active: True
  created_at: 2024-01-01

# Snapshot after patch:
User object after save():
  id: 1
  name: "test"
  is_active: False       ← unexpected change
  created_at: None       ← unexpected change

The save() bug was fixed — but the patch accidentally broke how is_active and created_at get assigned. An exit code would never catch this. The attribute snapshot diff surfaces it immediately.

This layer exists to catch a specific failure mode: a patch that fixes the bug case but silently changes behavior that was correct before.

Why We Also Run Edge Case Tests

After the main reproduction passes, the system requires the agent to run reproduce_edgecase.py — a set of edge case tests that were generated during the Reproduce stage.

Edge cases test whether the fix holds up at the boundaries. If the issue describes "fails when given an empty string," the main reproduction only tests that one scenario. The edge cases add adjacent inputs: passing None, passing an extremely long string, and so on. If the main reproduction passes but edge cases fail, the system flags it clearly: this patch probably fixed the specific symptom, not the root cause.

That said, edge cases can themselves be wrong. For instance, an edge case might call a deep internal function directly, bypassing the real user entry point, and effectively demand that the fix land at a specific internal location. A correct fix could then be falsely rejected.

When this happens, the system allows the agent to reject that edge case (EDGECASE_REJECT) and generate a more appropriate boundary test to replace it. This ensures edge cases remain supplementary evidence — not incorrect constraints that block valid fixes.

One Final Gate Before Committing

Once all the above checks pass, the agent can move toward committing. But before the commit goes through, the system runs a final gate that verifies five things simultaneously:

Reproduction quality: Did the reproduction walk the real code path?
Baseline comparison: Is happy-path behavior consistent before and after the patch?
Focus alignment: Do the changes fall within the area causally connected to the root cause?
Edge case status: Did edge cases pass, or were any rejections justified?
Hypothesis completeness: Does the root cause analysis include both supporting and refuting evidence?

If any of these fail, the commit is blocked and the agent gets specific instructions on what to address next.

The reverse also matters: once all conditions are satisfied, the system locks the state and prevents the agent from running any further trace queries, file reads, or code changes. The only permitted action is the commit itself.

Why? An agent in the "good enough to ship" zone is actually at its most dangerous. It keeps browsing files, keeps querying traces, keeps making small tweaks — and ends up breaking a working, validated fix. Locking the commit-ready state is how we protect it.

Three Posts, One Complete Debug Pipeline

Here's what this series covered:

Part 1 was about making reproduction accurate and trustworthy: run the program before touching code, collect reliable runtime facts, replace guesswork with evidence, and ensure the reproduction follows the real user path.

Part 2 was about finding the root cause and generating the right fix: use runtime facts to constrain the agent — the hypothesis card requires both supporting and refuting evidence, 5-Whys forces the trace to go upstream to wherever the bad value was produced, and focus alignment prevents fix drift.

Part 3 was about proving the fix actually works: break "fixed" into three layers of provable evidence, use baseline comparison to detect side effects, and use the commit gate to prevent unsafe submissions.

The logic across all three stages comes down to one sentence: establish facts first, make decisions based on those facts, then verify the decisions against facts.

If you want to see the implementation, the full methodology is available in our GitHub repository: Syncause Debug Skill. If you're building your own AI coding agent or want to walk through this debug process yourself, give it a try.