Reproducing induction heads in GPT-2, and the bug I caught in my own v0.1

I wanted to do one end-to-end mechanistic-interpretability project that one person could verify from scratch, no Transformer Circuits mystique, no 2000-line frameworks, just a published finding I could reproduce with paper, pencil, and about 300 lines of numpy. I picked the induction-head result from Olsson et al. (Anthropic, 2022): the observation that small transformers develop specific attention heads that implement an in-context-copying circuit, and that ablating those heads tanks in-context learning.

I shipped v0.1 of mech-interp-starter. The README said I had recovered 5/5 of GPT-2 small's published induction heads. Two hours later, I realized my scoring formula was off by one. This post is the math, the bug, why I'm glad I caught it before anyone looked, and what the corrected v0.2 run actually shows.

The setup

Feed the model a doubled random token sequence of length 2N:

[t_0, t_1, ..., t_{N-1}, t_0, t_1, ..., t_{N-1}]

The induction-head circuit works like this: at some position q in the second half, the model has just seen a token it's already seen once before (because the sequence is repeating). A well-formed induction head attends from q to the first-half position that holds the token that originally followed the current one, then the OV circuit copies that token's embedding into the residual stream, which gets read out as the model's prediction for position q + 1.

If the head works, the prediction for q + 1 is exactly what actually comes next in the repeat. Measuring that attention pattern is the prefix-matching score.

The math: step by step

For query position q in the second copy (where N ≤ q < 2N), the token at q is t_{q-N}. The previous occurrence of that same token is at position q - N. The token right after that previous occurrence, the one the induction head should attend to and copy, is at position (q - N) + 1 = q - N + 1.

So the canonical key for query q is q - N + 1.

The prefix-matching score for attention head h:

score_h = mean over q in [N, 2N - 1) of attention_pattern[h, q, q - N + 1]

The upper range is 2N - 1 because the last query position would need a target at N, which is in the second copy, not a meaningful match.

The bug

My v0.1 code had:

query_positions = np.arange(N + 1, 2 * N)
key_positions   = query_positions - N

Which means the target for query q was q - N, not q - N + 1. I was measuring attention to the previous occurrence of the current token, not to the token right after it.

These are different heads. The first pattern is a "duplicate-token" head: it recognizes that the current token has appeared before. That's a precursor circuit, and some heads do exhibit it, but it's not the full induction pattern. A real induction head does the next step: copying the follow-up token.

The silent failure mode. My v0.1 unit tests all passed. Because I'd constructed them against my own (wrong) scoring formula, they were self-consistent with the bug. The tests enforced nothing about the paper's definition, only that my scoring function agreed with itself.

How I caught it

I didn't catch it from the code. I caught it from the docstring.

I was writing the v0.2 README, pulling the derivation out of my head so I could explain it cleanly. Halfway through the example walk-through, "at position i = N+1, the head has just seen t_0", I noticed my example said the target was position 1, which is where t_1 lives, not t_0. My code said the target was position i - N = 1. Those two facts agreed. But why the target was 1 was the question: was it because "attend to previous t_0" or because "attend to the token after previous t_0"?

I went back to the paper. Quote from Olsson et al.:

For each token X_i in the second copy, the prefix matching score is computed as the attention paid by that token to X_i+1 in the first copy.

X_i+1, not X_i. In 0-indexed terms, query at position N + k should attend to position k + 1, i.e., q - N + 1. My code used q - N. Off by one, exactly as I'd started to suspect.

What the corrected run shows

Published induction heads in GPT-2 small, per multiple Transformer Circuits posts: L5H1, L5H5, L6H9, L7H2, L7H10. Running v0.2 with the corrected formula, n_trials=10, seed=0:

             gpt2, prefix-matching top 10
┏━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ rank ┃ layer.head ┃  score ┃
┡━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│    1 │ L5H5       │ 0.9271 │  ← published
│    2 │ L7H10      │ 0.8974 │  ← published
│    3 │ L6H9       │ 0.8865 │  ← published
│    4 │ L5H1       │ 0.8555 │  ← published
│    5 │ L7H2       │ 0.8102 │  ← published
│    6 │ L10H7      │ 0.5451 │
│    7 │ L10H1      │ 0.4979 │
│    8 │ L9H9       │ 0.4893 │
│    9 │ L9H6       │ 0.4742 │
│   10 │ L5H0       │ 0.4448 │
└──────┴────────────┴────────┘

recovered 5/5 published induction heads in top-10.

All five published heads in the top five slots, each scoring above 0.81. The gap between rank 5 (0.81) and rank 6 (0.54) is wide enough that the signal isn't ambiguous.

The causal check: ablation

Prefix-matching is descriptive: "this head attends at the right position." It doesn't prove the head is causally responsible for in-context learning. For that, you ablate: zero out the head's contribution and measure how much in-context-learning loss gets worse.

v0.2 ships a context manager that hooks the attention-output projection and zeroes the per-head slice of its input before projection, which is equivalent to removing that head's contribution to the residual stream. ICL loss = cross-entropy on predicting tokens in the second copy (the half where in-context learning kicks in).

Zeroing the five published induction heads vs a matched null set of non-induction heads at similar layers:

Ablation set	Baseline ICL loss	Ablated ICL loss	Δ
L5H1, L5H5, L6H9, L7H2, L7H10 (published induction)	0.7144	5.3056	+642.7%
L8H0, L8H3, L9H0, L9H2, L11H0 (matched null)	0.7144	0.9126	+27.7%

A 23× gap in loss increase. Ablating the induction-head set collapses in-context learning entirely; ablating five random mid-layer heads produces a modest bump consistent with "any five heads contribute something." This is the specific claim the paper makes: these heads are causally responsible for ICL, not just correlated with it.

Note one sharp caveat: I initially tried the null ablation on early layers (L0, L1, L2). Those produced +553% loss, not because those heads are doing the same thing as induction heads, but because early-layer attention is a prerequisite (the "previous-token heads" that load the keys). Using early-layer ablation as a null is a wrong null. The mid-to-late-layer null (L8, L9, L11) gives the clean comparison.

What I learned about how to ship this kind of work

The bug was cheap to fix and expensive to have shipped un-caught. The tests didn't save me, they hid the bug because I'd written them against the wrong spec. What saved me was trying to explain the math to someone else (the README's imagined reader) and noticing my prose disagreed with my code.

Writing about code has higher fidelity than testing code. Tests check your code against itself; prose checks your code against the world.

Three rules I've internalized from this:

Write the docstring before shipping the claim. If I can't rederive the formula from the paper in my own prose without looking at my code, my code might be right by accident.
Ship a regression-guard test specifically for the wrong version. test_off_by_one_regression_target_q_minus_n feeds attention concentrated on the v0.1 target (q - N) and verifies the score is zero. If anyone ever reverts the formula, that test fails loudly.
Log the bug transparently. v0.2's README and commit message say "v0.1 was wrong, here's how, here's the fix." That's not a weakness to hide; it's the strongest credibility signal a portfolio project can emit. Reviewers who matter know that everyone ships bugs; what they're looking at is how you caught and corrected yours.

Caveats I'm not papering over

One model. GPT-2 small. The published induction heads for Pythia, Gemma, and Llama are different; the framework generalizes but the specific head list doesn't.
Direct OV contribution only. The copying score I ship alongside prefix-matching measures W_U @ W_OV @ W_E. It ignores layer-norm and downstream composition. v0.3 will compute the effective OV through LN; the paper notes this doesn't change rankings at small scale but it does change the absolute numbers.
Static seed. All artifacts at seed=0, n_trials=10. Scores shift a few percent across seeds; rankings are stable.
Small N (32). The ICL effect grows with longer contexts. v0.3 will sweep N to reproduce the paper's phase-transition plot.

What's next

v0.3's roadmap: effective OV through layer norm, sweep N for the phase-transition plot, extend to Pythia-160M and Gemma-2-2B. v0.5 starts on SAE feature extraction, the standard next rung of interp work and the one most likely to have a v0.1 bug lurking in it, so I'm holding it for last.

The full code, the three committed run artifacts (prefix-matching, copying, ablation), and the regression-guard test are in the repo. The README says "recovered 5/5 published heads" and the artifact underneath it is real.

Next essay →