"Our agent passes 63% of tasks." Okay — why does it fail the other 37%?

SWE-bench-style agent benchmarks produce a single number: tests-pass rate. That's the right top-line metric, and it's almost entirely useless for making the agent better. The thing you actually need, when you're staring at the 37% that failed, is a categorization of how they failed. An agent that fails by hitting its iteration limit is a different problem from one that fails by making the wrong edit is a different problem from one that confidently declares success without running the tests.

When I built swe-agent-lite, I wanted the framework to produce that categorization automatically. This post is the taxonomy, why each tag is cheap to detect from the trajectory alone, and what the distribution of tags actually revealed when I started running Claude against the task set.

The setup

swe-agent-lite is deliberately small. 13 curated bug-fix tasks (3 multi-file, 1 hard), 5 tools (read_file, list_dir, edit_file, run_tests, finish), a sandboxed workspace per task. Each task has a canonical fix a human could apply in under a minute. The benchmark is measuring agent reliability, not agent reasoning about complex code — two different capabilities both worth measuring separately.

Every task run produces a trajectory: an ordered list of tool calls the agent made, whether it called finish, and whether pytest passes at the end. The failure-mode tagger is a pure function over that trajectory plus the final score.

The eight tags

TagTriggerWhat it usually means
hit_iteration_limitAgent never called finish and we bailedStuck in a loop, or incrementally failing without recognizing it
never_ran_testsNo run_tests call in the trajectoryAlmost always bad — agent is flying blind
didnt_editNo edit_file call before finishAgent gave up without attempting a fix
edit_but_no_retestEdits weren't followed by a test run before finishAgent assumes its fix works without verifying
tests_were_timing_outMultiple TIMEOUT outputs from the agent's own test runsAgent wrote an infinite loop and didn't diagnose it
agent_timeoutThe scorer's final pytest call itself timed outWorse version of the above — agent left the workspace in a hanging state
exit_earlyAgent called finish but tests still failOver-confidence — agent thinks it solved it, didn't
framework_error:*Our own code blew upBug in the benchmark runner itself

These are heuristic, not causal. exit_early plus edit_but_no_retest is a strong signal the agent stopped checking its work. hit_iteration_limit plus never_ran_tests probably means the agent was reading files forever and never tried anything. The combinations matter as much as the individual tags.

Why trajectory-based tags are the right scope

The obvious alternative is LLM-based failure classification: show a judge model the full trajectory and ask "why did this fail?". I considered it, didn't do it. Two reasons:

  1. Ground-truth latency. Trajectory tags run in 10ms on a successful or failed task. LLM classification adds ~5 seconds per task and costs money. When you're running the benchmark in CI on every prompt change, deterministic tagging beats judge-in-the-loop every time.
  2. Traceability. Each tag maps to a specific predicate over the trajectory. If a reviewer disagrees with a tag, they can read the code in 60 seconds and decide whether the predicate is well-formed. A judge model's verdict is a black box that disagrees with you in ways that are hard to audit.

The tradeoff is surface area. Trajectory tags only catch failure shapes I thought to enumerate. A novel failure mode — say, an agent that correctly edits the file but deletes the test file too — wouldn't match any of my eight tags. That's a real gap. The mitigation is that I add a new tag the first time I encounter a new shape, and the test set grows with me.

What running it actually teaches you

Here's a rough distribution from my own runs. Exact numbers depend on which model and system prompt:

TagFrequency among failuresFix difficulty
edit_but_no_retestmost commonCheap — add "always run tests after editing" to the system prompt
exit_earlycommonCheap — add "only call finish after seeing green tests" to the system prompt
hit_iteration_limitmiddleMedium — sometimes the cap is too tight, sometimes the agent is legitimately stuck
didnt_editrareUsually signals a task the agent fundamentally doesn't understand
never_ran_testsrareEasy to prevent with prompt; hard to prevent with certainty
tests_were_timing_outrareHard — agent wrote infinite-loop code and failed to diagnose

What this changed about how I think about coding agents:

Most "agent fails" in benchmark runs aren't reasoning failures. They're procedural failures.

The agent understood the bug, made the right edit, and then either (a) didn't run the tests before finishing or (b) called finish before checking whether the tests actually passed. Those are cheap prompt fixes. Before I had failure tags, I would have looked at a 63% pass rate and assumed the 37% was "tasks the model can't solve." Post-tagging, most of it is "tasks the model can solve if you remind it to check its work."

That's a meaningfully different picture. It means the highest-leverage tuning isn't "make the model smarter" — it's "make the system-prompt scaffold more rigorous about the edit-verify-finish loop."

What this doesn't catch

Pattern: what I'd take to any agent benchmark

Three rules from building this:

  1. Tag the trajectory, not just the outcome. A boolean "did tests pass" is nowhere near enough. Minimum viable signal is: did the agent finish cleanly, did it verify, and did any tool call produce an error.
  2. Pick tag names that are accusations, not descriptions. edit_but_no_retest says what went wrong in a way a product-minded reader can act on. "Insufficient test verification coverage" says the same thing in bureaucratic language nobody can debug from.
  3. Ship the tagger as a pure function. No LLM in the categorization loop. Determinism is what makes failure modes comparable across runs; an LLM judge that categorizes inconsistently breaks the whole premise.

What this is for, honestly

swe-agent-lite isn't a competitor to SWE-bench. It's a smoke test I can run in two minutes after tweaking a system prompt, with enough signal to tell me which direction I moved the agent. That's a useful object, and as far as I can tell, not one that's well-represented in the public research tooling. Most benchmarks are either tiny and useless or enormous and slow. The middle — "small and legible" — is where you actually do iteration.

The failure-mode tags are what make it legible. Without them it's a pass rate. With them, it's a diagnostic report you can act on.