Open to research-engineer roles at frontier AI labs
AI Consultant @ Amneal · ICLR 2026 author · ex-Amazon Prime Video

LLM agent safety: designed, shipped, and published.

Engineer and independent researcher on LLM agent safety. Sole author of an ICLR 2026 workshop paper proposing GIRA, a multi-layer safety gate for tool-using agents, paired with OEP, an SRE-grounded eval protocol. Currently AI Consultant at Amneal Pharmaceuticals; previously SDE II at Amazon Prime Video (LLM systems for a 200M+ user catalog) and founding engineer at Aspecta.ai. On the side: 11 open-source repos around Claude (interpretability, evals, agents, retrieval), each shipped with committed run artifacts and writeups.

STEM OPT · H-1B sponsorship required · cap-exempt eligible
ICLR '26
Workshop paper · sole author
200M+
Users impacted at Amazon
5 / 5
GPT-2 induction heads recovered
11
Shipped open-source repos
Publications

A paper on guarded agents.

My ICLR 2026 workshop submission proposing a safety architecture for tool-using LLM agents, plus an SRE-grounded eval protocol for measuring it.

ICLR 2026 Workshop Agents in the Wild Sole author Submission

Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

Dhruv Patel independent

Tool-using LLM agents are increasingly being put on the path to real production action (file mutations, API calls, infrastructure changes) at the same time as we are still discovering the prompt-injection and over-trust failure modes that come with them. This paper makes two contributions toward closing that gap.

GIRA (Guarded Incident Response Agent) is a multi-layer safety gate that separates LLM proposal from action authorization. Five layers (policy, schema, risk, injection detection, and human-in-the-loop escalation) sit between the model's tool call and any state-changing side-effect, so a successful injection or hallucinated action gets caught before it lands in production.

OEP (Operational Evaluation Protocol) is the SRE-grounded eval scaffolding that measures the architecture itself rather than just headline pass rates: blast radius, injection success rate (ISR), and unauthorized action rate (UAR), alongside conventional task success. The safety gate can be regression-tested as the agent is updated.

01
GIRA architecture
Five-layer safety gate (policy · schema · risk · injection · escalation) separating proposal from authorization.
02
OEP protocol
SRE-grounded eval surface: blast radius + ISR + UAR alongside task success, regression-testable as agents update.
03
Operational lens
Frames agent safety in vocabulary on-call engineers already use, so the gate can sit on a real production runbook.
Cite
@inproceedings{patel2026guarded,
  title    = {Guarded Tool-Using {LLM} Agents for Incident Response:
              A Safety-Gated Architecture and Operational Evaluation Protocol},
  author   = {Patel, Dhruv},
  booktitle= {ICLR 2026 Workshop on Agents in the Wild},
  year     = {2026},
  url      = {https://openreview.net/forum?id=LBt5eX6OKx}
}
Experience

Where I've shipped.

Six years across pharma AI consulting, Amazon Prime Video, a Series A developer-platform startup, an India-based ML engineering role, and Indian Army defense R&D. Each role a step up in scope, complexity, and stakes.

Amneal Pharmaceuticals Current

2026-Present
AI Consultant
Long Island, NY
  • Claude Code research agent for NDC enrichment. Built an agent that pulled 1,000+ NDCs with their attributes from external pharmaceutical data sources in ~5 hours, the same scope a manual offshore effort had covered ~250 of in 6 months. Standardized the workflow into a reusable internal tool.
  • Enterprise AI & data pipelines. Building AI, data analytics, and master data management pipelines spanning Amneal's pharmaceutical product systems.
  • Direct application of GIRA-style guards. Where the agent acts on internal databases, layering policy + schema + risk gates before a write operation lands. The ICLR paper grew out of exactly this kind of production constraint.

Amazon

2025-2026
Software Development Engineer II
Prime Video CoreTech · New York, NY
  • LLM-powered classification engine. Designed and shipped the micro-genre system for Prime Video's storefront. Personalized title carousels for a 200M+ user catalog, with cross-team coordination across product, design, and ML platform.
  • Real-time ML pipelines. Built on AWS SageMaker, DynamoDB, S3, and EMR. ~30% latency reduction in benchmarked runs; load-tested across IAD, PDX, DUB, ZAZ regions using CloudWatch, ALBs, and CloudFormation.
  • Bedrock Agents on-call assistant. Built an internal Amazon Bedrock Agents-based assistant for the on-call escalation workflow, used by 350+ Prime Video engineers. First production exposure to tool-using agent safety, which directly informed my later GIRA work.

Aspecta.ai

2024-2025
Founding Engineer (IC + PM)
Early-stage startup · Santa Clara, CA
  • 0 → 400K users. Directed cross-functional teams spanning engineering, product, marketing, and finance. Scaled the developer ecosystem 1000× within 12 months.
  • Global hackathon leadership. Launched flagship event with 1,100+ participants and $50K+ prize pool. 110+ project submissions; directly influenced 2 product features shipped within 6 months.
  • Enterprise partnerships. Established strategic relationships with Google, Amazon, Microsoft and leading AI startups. 40% increase in developer engagement through co-developed initiatives.
  • Technical workshops. Built and facilitated 8+ workshops on GenAI and Web3. Produced comprehensive runbooks and partner enablement materials.

NBVP Technologies

2019-2022
Software Engineer
Surat, Gujarat, India
  • Responsive UI for an internal recruiting product. Mobile, tablet, and desktop support with 95%+ browser compatibility. User-satisfaction scores moved from 6.2 → 8.9 across 150+ surveys.
  • Real-time interview status via WebSocket. Live updates that cut page-refresh needs by 85% and lifted candidate-experience scores by 42%.
  • Redis caching strategy. Session management and hot-path caching reduced database queries by 55% and improved page load by 45%.

Indian Army Defense R&D

2019
AI / ML Engineer
Indore, Madhya Pradesh, India
  • Real-time image processing system. Python + OpenCV with multi-threading. 120 FPS throughput (60× over baseline) on 1920×1080 video streams at <100ms latency.
  • Distributed processing pipeline. Multiprocessing + message queues across 8 CPU cores; 6.5× speedup, with the throughput characteristics needed for real-time surveillance.
Open source

Eleven repos, actually shipped.

Weekends and evenings, in public. Every repo has CI, tests, and committed run artifacts you can read before installing. Grouped by what signal they send for frontier-lab hiring.

Safety, evals & interpretability

04 · v0.2
mech-interp-starter v0.2

Reproducing Olsson et al. (Anthropic, 2022)

Prefix-matching + copying score + head ablation on GPT-2 small. Numpy-only scoring math with an explicit off-by-one regression guard. Caught and transparently logged a bug in my own v0.1 instead of shipping the wrong result.

5 / 5
Published heads recovered in top-10
+642%
ICL loss under induction-head ablation
pythonpytorchtransformersnumpy
claude-evals v0.2

Safety eval suite with a calibrated judge

Sycophancy, refusal calibration (XSTest-style), jailbreak robustness. 52 hand-curated cases. v0.2 ships a hand-labeled gold set so you can measure the judge's accuracy before trusting any subject pass rate.

52
Hand-curated eval cases
31
Gold verdicts for judge calibration
pythonclaude APIpydanticmessages.parse
swe-agent-lite v0.2

A tiny SWE-bench, with failure modes

13 curated bug-fix tasks (3 multi-file), 5-tool agent surface, sandboxed workspace. Per-task failure-mode tags (hit_iteration_limit, edit_but_no_retest, exit_early) so you know why a task failed, not just that it did.

13
Curated bug-fix tasks
3
Multi-file (incl. 1 hard)
pythontool-usepytestsandboxed
prompt-gym v0.1

Regression tests for LLM prompts

YAML specs, four matchers (exact / contains / regex / llm_judge). Non-zero exit codes on failure and regression so it drops into CI as a gate.

pythonyamltyperci-gate

Retrieval & agent tooling

05 · v0.6+
personal-rag v0.7

Local RAG with hybrid retrieval & Contextual Retrieval

LanceDB + fastembed, hybrid BM25 / dense retrieval via RRF, Anthropic Contextual Retrieval, inline citations, watch-mode reindex, server-rendered web UI, similar for finding related chunks.

pythonlancedbfastembedBM25fastapi
mcp-zettel v0.6

MCP server for a personal Zettelkasten

[[wiki-link]] graph, keyword + semantic search, prompt templates, Mermaid diagrams, and hybrid link suggestions on note create via RRF fusion of keyword and semantic ranks.

pythonMCPfastembedRRF
claude-pr-reviewer v0.6

Inline PR review comments via gh api

CLI + GitHub Action. Per-file chunking for large diffs, .claude-review.yml repo config, diff-hash review cache, and a calibration harness that measures precision / recall against labelled finding sets.

pythonclaude APIunidiffgithub actions
agent-interviewer v0.6

Mock interview CLI with four Claude personas

Behavioral, system-design, coding, case. Per-dimension scoring grounded in the transcript, YAML persona packs, replay through a different model, side-by-side diff between feedback variants in a small web viewer.

pythonclaude APIfastapipydantic
paper-digest v0.6

arXiv / OpenReview / ACL → structured summary

Problem · method · results · limitations. Interactive follow-up Q&A grounded in the paper, reading-list batch mode, watch-a-folder auto-digest: searchable history.

pythonclaude APIpypdfwatchdog

Trading & MLOps

02 · v0.1
algo-trader v0.1

Paper-first algorithmic trading

Options wheel + mean-reversion strategies on ETFs. Walk-forward backtests with realistic slippage and costs, risk and compliance gates on every order, FastAPI + Next.js 15 control plane.

pythonpolarsfastapinext.jspostgres
mlops-customer-support v0.1

End-to-end ML pipeline for support analytics

PostgreSQL / MongoDB / Qdrant / Redis. HuggingFace sentiment / topic / NER. Prometheus + Grafana + Evidently for drift detection.

pythonhuggingfaceprometheusgrafanadocker
Projects

Earlier applied work.

Three self-directed builds predating the open-source portfolio above, each one shipped end-to-end with measured outcomes.

LLM TutorBot pilot

Personalized GenAI tutoring

Serverless tutoring assistant with domain fine-tuning and semantic-similarity routing (BERT). Pilot deployment measured user-side satisfaction and engagement.

95%
User satisfaction
2.4×
Engagement lift in pilots
huggingfacepytorchAWS LambdaBERTCUDA
AI Social Pilot shipped

SMB content automation

React Native platform combining Supabase + n8n + Claude Vision. Extracts brand identity from uploaded assets and auto-generates social captions + visuals with a swipe-approval flow for non-technical founders.

90%
Faster content creation
Posting efficiency
react nativesupabasen8nclaude vision
RAGWorks production

Enterprise document Q&A at 5M-doc scale

Retrieval-augmented QA service with caching and cross-encoder reranking. Benchmarked on a 5M-document corpus. The token-spend and latency wins came from the rerank + cache layer, not the base retrieval.

-58%
Token spend
+19%
NDCG@10
fastapifaisspgvectoropenaicross-encoder
Writing

Four essays on what I built.

The story behind the ICLR paper, and long-form writeups on the mech-interp reproduction, the calibrated safety-eval judge, and why pass rates aren't enough for coding-agent benchmarks. Writing is how I think; these are the technical decisions I'd want a reviewer to see.

interpretability11 min read

Reproducing induction heads in GPT-2, and the bug I caught in my own v0.1

I shipped a prefix-matching score, then realized my formula was off by one. Here's the math derivation, how I caught it, why publishing the wrong result would have been worse than publishing nothing, and what the corrected run shows: 5/5 published heads recovered, +642% ICL-loss from targeted ablation.

Read the essay
safety evals9 min read

Why your LLM judge might be wrong, and how to measure it

"Claude judges Claude" is the default setup for safety evals and the most obvious critique. The fix isn't to abandon LLM judges; it's to ship a hand-labeled gold set, run the judge on it, and publish its accuracy alongside every subject-model number. Here's the design and honest limits.

Read the essay
agents8 min read

Pass rates aren't enough: failure-mode tags for coding agents

When a coding agent fails a task, "63% pass rate" tells you nothing about why it failed. I built an 8-tag categorization, hit_iteration_limit, edit_but_no_retest, never_ran_tests, exit_early, and so on, and how that changed my priors on which failures were worth fixing first.

Read the essay
Community & leadership

Building developer ecosystems.

Scaling dev communities from hundreds to hundreds of thousands, organizing global hackathons, and mentoring the next wave of ML engineers.

40K+

Developer community

Built and scaled the Aspecta.ai developer ecosystem to 40,000+ active builders worldwide through 8+ hands-on workshops across GenAI and Web3.

$50K

Global hackathon

Launched flagship event with 1,100+ participants, 110+ project submissions, and a $50K prize pool, two features shipped back into the product within six months.

350+

Engineers served

Internal on-call agent I shipped via Amazon Bedrock Agents automates escalation workflows for 350+ Prime Video engineers.

8+

Technical workshops

GenAI and Web3 deep-dives at Aspecta.ai with runbooks, partner enablement docs, and curriculum used by hundreds of developers.

GDG

Google Developer Groups

Organizer. Technical meetups and mentorship on ML systems, production deployment, and applied AI for early-career engineers.

MLH

Major League Hacking

Judge at national-level student hackathons. Evaluate technical depth, project execution, and presentation craft across hundreds of student teams.

About

How I actually work.

I'm an engineer and independent researcher working on LLM agent safety. The thread that ties the day job, the paper, and the open-source repos together is the same: tool-using agents are getting put on the path to real production action faster than the safety scaffolding around them is being built. I'm trying to close that gap with code, with measurement, and with writing.

Right now I'm AI Consultant at Amneal Pharmaceuticals: building Claude-Code-based research agents that pull pharmaceutical product data from external sources at orders of magnitude better throughput than the manual baseline, with the safety gates wrapped tight enough that an agent never writes anything to internal databases without a verified call site. Before that, I was SDE II at Amazon Prime Video CoreTech: where I shipped LLM systems for a 200M+ user catalog and a Bedrock Agents on-call assistant for 350+ engineers, the production exposure that became the seed for the ICLR paper.

The ICLR 2026 workshop paper (Agents in the Wild, sole author) proposes GIRA: a five-layer safety gate that separates LLM proposal from action authorization, paired with OEP: an SRE-grounded eval protocol that measures blast radius, injection success rate, and unauthorized action rate. The eleven open-source repos either feed into that, claude-evals, swe-agent-lite, mech-interp-starter, or surround it, retrieval and agent-tooling I keep finding myself wanting day-to-day.

I write code that admits its failure modes: every repo ships with committed run artifacts so you can read the headline claim before installing, and the mech-interp repo has an explicit regression guard for the off-by-one I caught in my own v0.1. Currently interviewing for research-engineer and frontier-engineer roles at AI labs. Based in New York, open to relocation. STEM OPT, H-1B cap-exempt eligible.

Education

Where I learned.

Rutgers University

New Brunswick, NJ
M.S. in Computer Science
GPA 3.9 / 4.0

Charusat University

Gujarat, India
B.S. in Computer Science & Engineering
GPA 3.7 / 4.0
Stack

What I reach for.

The tools I actually ship with, split between the day job at Amazon and the open-source portfolio.

Languages
Python 3.12 TypeScript Java SQL
LLM / ML
Claude API Bedrock Agents messages.parse Prompt caching MCP fastembed HuggingFace PyTorch Polars
Cloud / Backend
AWS SageMaker DynamoDB S3 / EMR FastAPI Next.js Docker PostgreSQL LanceDB · Qdrant · Redis · MongoDB
Ops & tooling
Airflow GitHub Actions Prometheus Grafana Evidently uv · pytest Typer · Rich
Contact

Let's make something verifiable.

Best path in: email. I reply within a day for role conversations; same for interesting open-source collaborations.