"Our agent passes 63% of tasks." Okay — why does it fail the other 37%?
SWE-bench-style agent benchmarks produce a single number: tests-pass rate. That's the right top-line metric, and it's almost entirely useless for making the agent better. The thing you actually need, when you're staring at the 37% that failed, is a categorization of how they failed. An agent that fails by hitting its iteration limit is a different problem from one that fails by making the wrong edit is a different problem from one that confidently declares success without running the tests.
When I built swe-agent-lite, I wanted the framework to produce that categorization automatically. This post is the taxonomy, why each tag is cheap to detect from the trajectory alone, and what the distribution of tags actually revealed when I started running Claude against the task set.
The setup
swe-agent-lite is deliberately small. 13 curated bug-fix tasks (3 multi-file, 1 hard), 5 tools (read_file, list_dir, edit_file, run_tests, finish), a sandboxed workspace per task. Each task has a canonical fix a human could apply in under a minute. The benchmark is measuring agent reliability, not agent reasoning about complex code — two different capabilities both worth measuring separately.
Every task run produces a trajectory: an ordered list of tool calls the agent made, whether it called finish, and whether pytest passes at the end. The failure-mode tagger is a pure function over that trajectory plus the final score.
The eight tags
| Tag | Trigger | What it usually means |
|---|---|---|
hit_iteration_limit | Agent never called finish and we bailed | Stuck in a loop, or incrementally failing without recognizing it |
never_ran_tests | No run_tests call in the trajectory | Almost always bad — agent is flying blind |
didnt_edit | No edit_file call before finish | Agent gave up without attempting a fix |
edit_but_no_retest | Edits weren't followed by a test run before finish | Agent assumes its fix works without verifying |
tests_were_timing_out | Multiple TIMEOUT outputs from the agent's own test runs | Agent wrote an infinite loop and didn't diagnose it |
agent_timeout | The scorer's final pytest call itself timed out | Worse version of the above — agent left the workspace in a hanging state |
exit_early | Agent called finish but tests still fail | Over-confidence — agent thinks it solved it, didn't |
framework_error:* | Our own code blew up | Bug in the benchmark runner itself |
These are heuristic, not causal. exit_early plus edit_but_no_retest is a strong signal the agent stopped checking its work. hit_iteration_limit plus never_ran_tests probably means the agent was reading files forever and never tried anything. The combinations matter as much as the individual tags.
Why trajectory-based tags are the right scope
The obvious alternative is LLM-based failure classification: show a judge model the full trajectory and ask "why did this fail?". I considered it, didn't do it. Two reasons:
- Ground-truth latency. Trajectory tags run in 10ms on a successful or failed task. LLM classification adds ~5 seconds per task and costs money. When you're running the benchmark in CI on every prompt change, deterministic tagging beats judge-in-the-loop every time.
- Traceability. Each tag maps to a specific predicate over the trajectory. If a reviewer disagrees with a tag, they can read the code in 60 seconds and decide whether the predicate is well-formed. A judge model's verdict is a black box that disagrees with you in ways that are hard to audit.
The tradeoff is surface area. Trajectory tags only catch failure shapes I thought to enumerate. A novel failure mode — say, an agent that correctly edits the file but deletes the test file too — wouldn't match any of my eight tags. That's a real gap. The mitigation is that I add a new tag the first time I encounter a new shape, and the test set grows with me.
What running it actually teaches you
Here's a rough distribution from my own runs. Exact numbers depend on which model and system prompt:
| Tag | Frequency among failures | Fix difficulty |
|---|---|---|
edit_but_no_retest | most common | Cheap — add "always run tests after editing" to the system prompt |
exit_early | common | Cheap — add "only call finish after seeing green tests" to the system prompt |
hit_iteration_limit | middle | Medium — sometimes the cap is too tight, sometimes the agent is legitimately stuck |
didnt_edit | rare | Usually signals a task the agent fundamentally doesn't understand |
never_ran_tests | rare | Easy to prevent with prompt; hard to prevent with certainty |
tests_were_timing_out | rare | Hard — agent wrote infinite-loop code and failed to diagnose |
What this changed about how I think about coding agents:
Most "agent fails" in benchmark runs aren't reasoning failures. They're procedural failures.
The agent understood the bug, made the right edit, and then either (a) didn't run the tests before finishing or (b) called finish before checking whether the tests actually passed. Those are cheap prompt fixes. Before I had failure tags, I would have looked at a 63% pass rate and assumed the 37% was "tasks the model can't solve." Post-tagging, most of it is "tasks the model can solve if you remind it to check its work."
That's a meaningfully different picture. It means the highest-leverage tuning isn't "make the model smarter" — it's "make the system-prompt scaffold more rigorous about the edit-verify-finish loop."
What this doesn't catch
- Subtle wrong edits. An agent that edits the wrong file but runs the tests and fails cleanly will show up as a plain failure with no tag. It looks the same as "unsolvable task" in the report. You need to read the trajectory to distinguish. I could add a tag for "edit happened to a file that isn't the one the task is about" but that requires knowing the canonical fix in advance, which is leaky.
- Overfitting to the tag set. Once you have a categorization, there's a temptation to optimize only for tag counts going down. That's fine if the tags capture everything that matters; it's dangerous when a new failure mode appears and your dashboard still looks green.
- Generalization. These tags were designed around my 5-tool agent surface. An agent with bash access has different failure modes (shell-escape escapes, rm -rf disasters) that need their own tags.
Pattern: what I'd take to any agent benchmark
Three rules from building this:
- Tag the trajectory, not just the outcome. A boolean "did tests pass" is nowhere near enough. Minimum viable signal is: did the agent finish cleanly, did it verify, and did any tool call produce an error.
- Pick tag names that are accusations, not descriptions.
edit_but_no_retestsays what went wrong in a way a product-minded reader can act on. "Insufficient test verification coverage" says the same thing in bureaucratic language nobody can debug from. - Ship the tagger as a pure function. No LLM in the categorization loop. Determinism is what makes failure modes comparable across runs; an LLM judge that categorizes inconsistently breaks the whole premise.
What this is for, honestly
swe-agent-lite isn't a competitor to SWE-bench. It's a smoke test I can run in two minutes after tweaking a system prompt, with enough signal to tell me which direction I moved the agent. That's a useful object, and as far as I can tell, not one that's well-represented in the public research tooling. Most benchmarks are either tiny and useless or enormous and slow. The middle — "small and legible" — is where you actually do iteration.
The failure-mode tags are what make it legible. Without them it's a pass rate. With them, it's a diagnostic report you can act on.