Achieving an 83.4% Fix Rate on SWE-bench Verified with Runtime Facts

In our latest SWE-bench Verified tests, we validated a new AI debugging paradigm: systematic debugging based on Runtime Facts. By introducing a dynamic tracing mechanism into the Live-SWE-agent architecture to provide the model with runtime context, we achieved a theoretical combined fix rate of 83.4% using the Google Gemini 3 Pro model, marking the highest known performance on the SWE-bench Verified evaluation to date.
Compared to the 77.4% baseline performance of the same model on the original Live-SWE-agent, we successfully fixed complex bugs that were previously unsolvable by leveraging Runtime Facts as a decision-making basis. We are gradually encapsulating this methodology into the Syncause Debugging Agent Skill, which is now open source on GitHub. If you are building your own AI coding agent or wish to manually experience this debugging workflow, we welcome you to try the repository.
This post details our testing methodology, data specifications, and how "Runtime Facts" address the biggest pain point for LLMs in code repair: root cause localization.
Experiment Results
To quantify the capability of Runtime Facts in fixing complex bugs, we adopted an Incremental Testing strategy.
- Baseline:
Live-SWE-agent+ Gemini 3 Pro Preview. On the SWE-bench Verified dataset, this combination had a baseline pass rate of 77.4%, leaving 113 failed cases. - Syncause Testing: We focused exclusively on these 113 cases that failed the baseline, applying the Syncause debugging methodology for targeted repairs.
- Result: Out of these 113 "hard cases," our agent successfully fixed 30.
- Combined Score: When combined with the cases already passed by the baseline, the theoretical comprehensive fix rate reached 83.4% (+6%).
A Note on Methodology:
The current data is calculated based on "Baseline Pass + Syncause Incremental Fix." We have not yet performed full regression testing on the cases originally passed by the baseline. While the code adjustments primarily enhance the debugging process and theoretically should not disrupt existing capabilities, full regression testing is ongoing to adhere to strict software engineering standards.
We are publishing these results now because this significant improvement strongly demonstrates that for deep logical errors where LLMs fail with static analysis alone, runtime data is an effective solution.
- Trajectory Records: https://github.com/Syncause/syncause-swebench
The Core Problem: Don't Guess. Observe.
The way current mainstream AI programming agents (including the original mini-SWE-agent) handle issues is essentially static guessing.
The agent reads the issue description, retrieves source code, and then relies on the LLM's "intuition" to infer the location of the bug. This is akin to a doctor prescribing medication based solely on a patient's history, without using a stethoscope or looking at X-rays. While effective for simple syntax errors or shallow logic, the accuracy drops sharply for complex bugs involving multi-layer calls and state dependencies.
Syncause's core philosophy is: Let the agent run the code, observe what the program actually does, and then make a decision.
The Bottleneck: Root Cause Localization
In analyzing the failed cases of SWE-bench, we found that the probability of an LLM fixing the wrong location is far higher than finding the right location but fixing it incorrectly.
This is understandable. A typical Django issue might involve dozens of files and hundreds of functions. The issue description is often phenomenological (e.g., "Calling X returns an error"), but the root cause may be hidden in function Z, called by Y, which was called by X, with 5-6 layers of calls in between. Relying on an LLM to infer this call chain purely by reading code is both slow and unreliable.
This is where Runtime Facts come into play.
Runtime Facts Driven Debugging Methodology
To solve the low accuracy caused by "static guessing," we introduced a new workflow based on Runtime Facts.
What are Runtime Facts?
In traditional LLM programming, the model only sees static code text. Runtime Facts refer to structured dynamic data generated during the actual execution of the program.
Instead of asking the LLM to simulate code execution in its "mind," we execute the code and "record" the process. We inject a lightweight Python Tracer during runtime. When a reproduction script or unit test runs, the Tracer automatically captures the following key information:
- Complete Call Stack: Precisely records the hierarchical relationship (Function A calls B, B calls C).
- Context Data: Specific values of arguments passed when each function is called, and the return values after execution.
- Exception Propagation: Where an error is thrown, and where it is caught or ignored.
These data points are no longer vague guesses, but absolute facts of program behavior. Here is a specific example:
Runtime trace:
testcase:
QuerySet.create(kwargs={"name": "test"})
|- ModelBase.__call__(args=...) at db/models/base.py:468
|- Model.__init__(kwargs={"name": "test"}) at db/models/base.py:501
|- Model.save(force_insert=True) at db/models/base.py:812
|- Model.save_base() at db/models/base.py:862
|- Manager._insert(values=...) at db/models/manager.py:85
|- QuerySet._insert() at db/models/query.py:1802
|- SQLInsertCompiler.execute_sql() at db/models/sql/compiler.py:1553
|- return [{"id": 1}]
Agent Architecture: The Three-Role Pipeline
We decomposed the task of "fixing bugs" into three roles, resembling a small software team:
- Analyst (Reproducer): Responsible for writing and validating test scripts that can trigger the bug.
- Developer: Locates the root cause and modifies the code based on the runtime evidence provided by the Analyst.
- Verifier: Runs tests to confirm the fix is effective and has no side effects.
The Role of Runtime Facts in Each Stage
Runtime Facts are not limited to a single step; they permeate the entire repair lifecycle, solving specific challenges at different stages:
Stage 1: Analyst — Eliminating False Positives
LLMs often write "false positive" test scripts: the script fails, but the error has nothing to do with the issue (e.g., import errors or syntax errors). In this stage, the primary role of Runtime Facts is intent verification.
After running the reproduction script, the Analyst checks the generated trace. If the issue describes an error during "model saving," but the trace shows the code never executed the save() method, or the error message does not match the description, the system determines the reproduction failed.
This ensures the Developer receives not just a "script that errors out," but a script that accurately triggers the target logic path.
Stage 2: Developer — Root Cause Localization
This is where Runtime Facts offer the most value. In complex Django or Flask projects, the entry function mentioned in the issue description is often 5-6 layers away from the actual bug. In this stage, the primary role of Runtime Facts is search space convergence.
The system matches key function names from the issue with the Runtime Trace (marked as [ISSUE_MATCH]). The Developer does not need to read dozens of files, but simply reads the Trace:
The user-called
translate_url()(entry) actually went through the pathreverse()->resolve()->normalize()(Bug point), andnormalizereceivedNoneas an argument.
This directly focuses the LLM's attention from "the entire codebase" to "this specific execution chain," drastically improving localization accuracy. For example, in the Django case above:
- Without Runtime Facts: The LLM sees the issue mention
Model.createand starts blindly guessing insidemodels.py. - With Runtime Facts: The LLM sees the trace showing
Model.createeventually calledSQLCompiler.execute_sqland returned[{"id": 1}], allowing it to pinpoint the issue immediately to the SQL generation phase.
Stage 3: Verifier — Side-effect Detection
The biggest risk in bug fixing is introducing regression. In this stage, the primary role of Runtime Facts is Diff Analysis.
When the Developer submits a fix, the Verifier looks not only at whether the test passed but also compares the Runtime Trace before and after the fix.
- Structured Diff: "After modification, the call to Function A disappeared, and a call to Function B was added."
- Failure Feedback: If the fix fails, the system feeds this "behavioral change" back to the Developer: "Your modification caused an early return, failing to execute the critical logic." This ensures the next attempt is not random trial-and-error, but iterative refinement based on previous failure experience.
Summary: From Guessing to Observing
The core philosophy of the entire system can be summarized in one sentence:
Don't guess. Observe.
Program behavior does not need to be guessed. Run it, and see what it does.
We are gradually engineering this research. Currently, this core debugging capability based on Runtime Facts has been encapsulated in the Syncause Debug Agent Skill. If you are tired of AI "guessing" your code problems and want them to possess stronger root cause analysis capabilities, you are welcome to visit our GitHub repository: GitHub: Syncause Debug Skill
We are continuously optimizing the Agent code to further improve repair accuracy and are gradually migrating more results into the Syncause product, dedicated to solving the pain point of "AI inability to fix root causes."