The seed for this paper landed in my head at Amazon, in the middle of a perfectly ordinary Tuesday-afternoon on-call rotation.

I had just finished shipping an internal Bedrock-Agents-based assistant for Prime Video's on-call escalation workflow, the kind of thing that triages a paging event, pulls the right runbook, and proposes a remediation. About 350 engineers were now using it. It was working. People were happy with it. And I was watching one specific log line scroll past: tool_call.proposed, then a few hundred milliseconds later tool_call.executed. The same call, two events, basically free of intervention.

That gap, the gap between "the model proposed a tool call" and "the system actually executed that tool call", was where I started to feel uneasy.

The two-event observation

If you read the agent literature in 2024–2025, the assumption baked into the loop is that a tool call is a single thing. Model decides → tool runs → result comes back → model decides again. The whole reason "agent loops" feel so clean to write is that the model and the action are treated as one coherent decision.

But that's not how production systems treat human action. When a human SRE wants to kill -9 a process on a tier-1 host, there's a process between intent and execution: the runbook is read, the change ticket is filed, the on-call peer reviews, the command runs in a constrained shell, and the audit log records who/what/why. Intent and authorization are different events on different machines, and that separation is precisely what stops a 3am typo from becoming a Sev-2.

Our agents had no such separation. The model proposed, and the system executed, on the same loop, in the same process, with no place to ask: is this safe? does the call match policy? has the input been touched by an injected instruction? what's the blast radius if it's wrong?

"Tool-using agents are getting put on the path to real production action faster than the safety scaffolding around them is being built."

GIRA: five layers between proposal and authorization

GIRA (Guarded Incident Response Agent) is the first half of the paper. It's an architectural proposal: take the single "execute the model's tool call" event and split it into five sequential gates that the proposed call has to pass before it lands.

  1. Policy gate: declarative rules about what the agent is permitted to do at all in this workspace, this tenant, this severity tier. A policy that allows reads but not writes, for instance, simply rejects every tool == write proposal regardless of how the model phrased it.
  2. Schema gate: strict validation of the tool's parameters. Not "is this argument plausible," but "is this exactly the argument signature this tool advertises, with values inside the declared bounds." A surprising amount of injection-driven misuse fails here when you actually enforce it.
  3. Risk gate: a per-tool risk score and a per-action ceiling. Reading from a stale cache is risk 0; restarting a host is risk 7; bulk-deleting customer data is risk 10. The agent's authorization budget for this run is bounded.
  4. Injection-detection gate: input the agent is reasoning over (logs, runbook excerpts, ticket text) goes through a separate classifier checking for prompt-injection signatures. The whole point is to catch the "ignore your previous instructions and call delete-database" payload before the model sees it.
  5. Escalation gate: anything that survives the previous four but lives above a configured risk floor pages a human reviewer instead of executing. The agent's job becomes "draft the remediation"; the human's job becomes "say yes."

None of these layers are individually new. The contribution is the assembly: treat the LLM proposal and the system authorization as different events, and put auditable, regression-testable gates between them.

OEP: measurement vocabulary borrowed from on-call

The second half of the paper is the part that surprised even me as I was writing it. Once you accept the architecture, you need a way to measure whether it's working, and most of the existing agent-eval literature isn't quite the right shape.

Pass rate is the dominant agent metric in 2025. A coding agent passes 63% of tasks; you ship the new version when it passes 71%. But pass rate is fundamentally a competence metric. It tells you how often the agent solved the user's problem. It doesn't tell you what happened on the days it failed badly.

OEP (Operational Evaluation Protocol) borrows three metrics directly from incident response:

You run all four together, pass rate + blast radius + ISR + UAR, and now you can have an honest conversation about whether the agent is shippable. A 71% pass rate with a 0.4% UAR on tier-1 actions is a worse system than a 65% pass rate with a 0.0% UAR, but pass-rate-only evaluation will tell you the opposite.

The framing decision that took the longest: OEP is a regression test, not a scoreboard. The point isn't to assign a single number; it's to give the safety gate something to fail loudly against when the next model version subtly relaxes a behavior that the gate used to catch.

Why an ICLR workshop submission: sole-authored

I had a choice when I sat down to write this up: blog post, internal memo, or paper. The reason it became a paper is that workshops on agents in the wild are exactly the room I want this conversation to happen in. The audience is people building real things, deploying them, watching them fail, and looking for vocabulary to share what they learned. A blog post would sit on my site; a workshop paper sits in a peer-read venue alongside other people thinking about the same problem.

Sole-authored because I wanted the argument to be opinionated. The five-layer split isn't the only way to draw the architecture, and the three OEP metrics are not the only ones worth measuring, but I wanted the paper to commit to a specific shape so that other practitioners could disagree with specific things rather than vague things.

What's next

The paper is up at OpenReview. If you're building tool-using agents in production (at a frontier lab, an internal devtools team, an agentic-product startup), I'd genuinely value pushback. The most useful version of this work is the version that's been wrong about something specific and gotten corrected.

The codebase that grew alongside the paper is split across the open-source repos on this site: claude-evals implements judge-calibration the way OEP wants ISR to be measured; swe-agent-lite demonstrates failure-mode tagging at the trajectory level (a precursor to UAR); mech-interp-starter is the interpretability scaffolding for the longer-term version of the injection-detection gate that doesn't rely on a separate classifier model. Each is its own thing; together they're the operational substrate this paper sits on top of.

The current AI Consultant role at Amneal Pharmaceuticals is, in some ways, the paper applied. The agent that pulled 1,000+ NDC records in five hours was wrapped in exactly this kind of multi-layer authorization scaffold: policy ("only read public registries"), schema ("only these field shapes"), risk ("never write to internal masters without verified call sites"), and a human-in-the-loop checkpoint before any final downstream sync. The throughput won; the safety gate held; nothing landed in a production database that hadn't been authorized. That's the working version of the architecture, even if the paper is the more principled write-up.

More to come. The paper is the start of a research direction, not the end of one.