Open to research-engineer roles at frontier AI labs

AI Consultant @ Amneal · ICLR 2026 author · ex-Amazon Prime Video

LLM agent safety: designed, shipped, and published.

Engineer and independent researcher on LLM agent safety. Sole author of an ICLR 2026 workshop paper proposing GIRA, a multi-layer safety gate for tool-using agents, paired with OEP, an SRE-grounded eval protocol. Currently AI Consultant at Amneal Pharmaceuticals; previously SDE II at Amazon Prime Video (LLM systems for a 200M+ user catalog) and founding engineer at Aspecta.ai. On the side: 11 open-source repos around Claude (interpretability, evals, agents, retrieval), each shipped with committed run artifacts and writeups.

STEM OPT · H-1B sponsorship required · cap-exempt eligible

ICLR '26

Workshop paper · sole author

200M+

Users impacted at Amazon

5 / 5

GPT-2 induction heads recovered

Shipped open-source repos

Read the paper See the open source

gpt-2 · prefix-match score live result

low

partial

induction head

Publications

A paper on guarded agents.

My ICLR 2026 workshop submission proposing a safety architecture for tool-using LLM agents, plus an SRE-grounded eval protocol for measuring it.

ICLR 2026 Workshop Agents in the Wild Sole author Submission

Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

Dhruv Patel independent

Tool-using LLM agents are increasingly being put on the path to real production action (file mutations, API calls, infrastructure changes) at the same time as we are still discovering the prompt-injection and over-trust failure modes that come with them. This paper makes two contributions toward closing that gap.

GIRA (Guarded Incident Response Agent) is a multi-layer safety gate that separates LLM proposal from action authorization. Five layers (policy, schema, risk, injection detection, and human-in-the-loop escalation) sit between the model's tool call and any state-changing side-effect, so a successful injection or hallucinated action gets caught before it lands in production.

OEP (Operational Evaluation Protocol) is the SRE-grounded eval scaffolding that measures the architecture itself rather than just headline pass rates: blast radius, injection success rate (ISR), and unauthorized action rate (UAR), alongside conventional task success. The safety gate can be regression-tested as the agent is updated.

GIRA architecture

Five-layer safety gate (policy · schema · risk · injection · escalation) separating proposal from authorization.

OEP protocol

SRE-grounded eval surface: blast radius + ISR + UAR alongside task success, regression-testable as agents update.

Operational lens

Frames agent safety in vocabulary on-call engineers already use, so the gate can sit on a real production runbook.

Read on OpenReview Google Scholar ↗

Cite

@inproceedings{patel2026guarded,
  title    = {Guarded Tool-Using {LLM} Agents for Incident Response:
              A Safety-Gated Architecture and Operational Evaluation Protocol},
  author   = {Patel, Dhruv},
  booktitle= {ICLR 2026 Workshop on Agents in the Wild},
  year     = {2026},
  url      = {https://openreview.net/forum?id=LBt5eX6OKx}
}

Experience

Where I've shipped.

Six years across pharma AI consulting, Amazon Prime Video, a Series A developer-platform startup, an India-based ML engineering role, and Indian Army defense R&D. Each role a step up in scope, complexity, and stakes.

Amneal Pharmaceuticals Current

2026-Present

AI Consultant

Long Island, NY

Claude Code research agent for NDC enrichment. Built an agent that pulled 1,000+ NDCs with their attributes from external pharmaceutical data sources in ~5 hours, the same scope a manual offshore effort had covered ~250 of in 6 months. Standardized the workflow into a reusable internal tool.
Enterprise AI & data pipelines. Building AI, data analytics, and master data management pipelines spanning Amneal's pharmaceutical product systems.
Direct application of GIRA-style guards. Where the agent acts on internal databases, layering policy + schema + risk gates before a write operation lands. The ICLR paper grew out of exactly this kind of production constraint.

Amazon

2025-2026

Software Development Engineer II

Prime Video CoreTech · New York, NY

LLM-powered classification engine. Designed and shipped the micro-genre system for Prime Video's storefront. Personalized title carousels for a 200M+ user catalog, with cross-team coordination across product, design, and ML platform.
Real-time ML pipelines. Built on AWS SageMaker, DynamoDB, S3, and EMR. ~30% latency reduction in benchmarked runs; load-tested across IAD, PDX, DUB, ZAZ regions using CloudWatch, ALBs, and CloudFormation.
Bedrock Agents on-call assistant. Built an internal Amazon Bedrock Agents-based assistant for the on-call escalation workflow, used by 350+ Prime Video engineers. First production exposure to tool-using agent safety, which directly informed my later GIRA work.

Aspecta.ai

2024-2025

Founding Engineer (IC + PM)

Early-stage startup · Santa Clara, CA

0 → 400K users. Directed cross-functional teams spanning engineering, product, marketing, and finance. Scaled the developer ecosystem 1000× within 12 months.
Global hackathon leadership. Launched flagship event with 1,100+ participants and $50K+ prize pool. 110+ project submissions; directly influenced 2 product features shipped within 6 months.
Enterprise partnerships. Established strategic relationships with Google, Amazon, Microsoft and leading AI startups. 40% increase in developer engagement through co-developed initiatives.
Technical workshops. Built and facilitated 8+ workshops on GenAI and Web3. Produced comprehensive runbooks and partner enablement materials.

NBVP Technologies

2019-2022

Software Engineer

Surat, Gujarat, India

Responsive UI for an internal recruiting product. Mobile, tablet, and desktop support with 95%+ browser compatibility. User-satisfaction scores moved from 6.2 → 8.9 across 150+ surveys.
Real-time interview status via WebSocket. Live updates that cut page-refresh needs by 85% and lifted candidate-experience scores by 42%.
Redis caching strategy. Session management and hot-path caching reduced database queries by 55% and improved page load by 45%.

Indian Army Defense R&D

2019

AI / ML Engineer

Indore, Madhya Pradesh, India

Real-time image processing system. Python + OpenCV with multi-threading. 120 FPS throughput (60× over baseline) on 1920×1080 video streams at <100ms latency.
Distributed processing pipeline. Multiprocessing + message queues across 8 CPU cores; 6.5× speedup, with the throughput characteristics needed for real-time surveillance.

Open source

Eleven repos, actually shipped.

Weekends and evenings, in public. Every repo has CI, tests, and committed run artifacts you can read before installing. Grouped by what signal they send for frontier-lab hiring.

Safety, evals & interpretability

04 · v0.2

mech-interp-starter v0.2

Reproducing Olsson et al. (Anthropic, 2022)

Prefix-matching + copying score + head ablation on GPT-2 small. Numpy-only scoring math with an explicit off-by-one regression guard. Caught and transparently logged a bug in my own v0.1 instead of shipping the wrong result.

5 / 5

Published heads recovered in top-10

+642%

ICL loss under induction-head ablation

pythonpytorchtransformersnumpy

repo artifacts writeup

claude-evals v0.2

Safety eval suite with a calibrated judge

Sycophancy, refusal calibration (XSTest-style), jailbreak robustness. 52 hand-curated cases. v0.2 ships a hand-labeled gold set so you can measure the judge's accuracy before trusting any subject pass rate.

Hand-curated eval cases

Gold verdicts for judge calibration

pythonclaude APIpydanticmessages.parse

repo artifact writeup

swe-agent-lite v0.2

A tiny SWE-bench, with failure modes

13 curated bug-fix tasks (3 multi-file), 5-tool agent surface, sandboxed workspace. Per-task failure-mode tags (hit_iteration_limit, edit_but_no_retest, exit_early) so you know why a task failed, not just that it did.

Curated bug-fix tasks

Multi-file (incl. 1 hard)

pythontool-usepytestsandboxed

repo artifact writeup

prompt-gym v0.1

Regression tests for LLM prompts

YAML specs, four matchers (exact / contains / regex / llm_judge). Non-zero exit codes on failure and regression so it drops into CI as a gate.

pythonyamltyperci-gate

repo

Retrieval & agent tooling

05 · v0.6+

personal-rag v0.7

Local RAG with hybrid retrieval & Contextual Retrieval

LanceDB + fastembed, hybrid BM25 / dense retrieval via RRF, Anthropic Contextual Retrieval, inline citations, watch-mode reindex, server-rendered web UI, similar for finding related chunks.

pythonlancedbfastembedBM25fastapi

repo

mcp-zettel v0.6

MCP server for a personal Zettelkasten

[[wiki-link]] graph, keyword + semantic search, prompt templates, Mermaid diagrams, and hybrid link suggestions on note create via RRF fusion of keyword and semantic ranks.

pythonMCPfastembedRRF

repo

claude-pr-reviewer v0.6

Inline PR review comments via `gh api`

CLI + GitHub Action. Per-file chunking for large diffs, .claude-review.yml repo config, diff-hash review cache, and a calibration harness that measures precision / recall against labelled finding sets.

pythonclaude APIunidiffgithub actions

repo

agent-interviewer v0.6

Mock interview CLI with four Claude personas

Behavioral, system-design, coding, case. Per-dimension scoring grounded in the transcript, YAML persona packs, replay through a different model, side-by-side diff between feedback variants in a small web viewer.

pythonclaude APIfastapipydantic

repo

paper-digest v0.6

arXiv / OpenReview / ACL → structured summary

Problem · method · results · limitations. Interactive follow-up Q&A grounded in the paper, reading-list batch mode, watch-a-folder auto-digest: searchable history.

pythonclaude APIpypdfwatchdog

repo

Trading & MLOps

02 · v0.1

algo-trader v0.1

Paper-first algorithmic trading

Options wheel + mean-reversion strategies on ETFs. Walk-forward backtests with realistic slippage and costs, risk and compliance gates on every order, FastAPI + Next.js 15 control plane.

pythonpolarsfastapinext.jspostgres

repo

mlops-customer-support v0.1

End-to-end ML pipeline for support analytics

PostgreSQL / MongoDB / Qdrant / Redis. HuggingFace sentiment / topic / NER. Prometheus + Grafana + Evidently for drift detection.

pythonhuggingfaceprometheusgrafanadocker

repo

Projects

Earlier applied work.

Three self-directed builds predating the open-source portfolio above, each one shipped end-to-end with measured outcomes.

LLM TutorBot pilot

Personalized GenAI tutoring

Serverless tutoring assistant with domain fine-tuning and semantic-similarity routing (BERT). Pilot deployment measured user-side satisfaction and engagement.

95%

User satisfaction

2.4×

Engagement lift in pilots

huggingfacepytorchAWS LambdaBERTCUDA

AI Social Pilot shipped

SMB content automation

React Native platform combining Supabase + n8n + Claude Vision. Extracts brand identity from uploaded assets and auto-generates social captions + visuals with a swipe-approval flow for non-technical founders.

90%

Faster content creation

5×

Posting efficiency

react nativesupabasen8nclaude vision

RAGWorks production

Enterprise document Q&A at 5M-doc scale

Retrieval-augmented QA service with caching and cross-encoder reranking. Benchmarked on a 5M-document corpus. The token-spend and latency wins came from the rerank + cache layer, not the base retrieval.

-58%

Token spend

+19%

NDCG@10

fastapifaisspgvectoropenaicross-encoder

P95 latency: 180ms → 60ms

Writing

Four essays on what I built.

The story behind the ICLR paper, and long-form writeups on the mech-interp reproduction, the calibrated safety-eval judge, and why pass rates aren't enough for coding-agent benchmarks. Writing is how I think; these are the technical decisions I'd want a reviewer to see.

ICLR 2026research10 min read

The story behind GIRA: why agent safety needs SRE-grade evaluation

The on-call agent I shipped at Amazon was the moment I realized "the model proposed a tool call" and "the system actually executes that tool call" need to be different events. Here's how that observation grew into a five-layer safety gate, an eval protocol that borrows from incident response, and a sole-author submission to ICLR 2026's Agents in the Wild workshop.

Read the essay OpenReview ↗

interpretability11 min read

Reproducing induction heads in GPT-2, and the bug I caught in my own v0.1

I shipped a prefix-matching score, then realized my formula was off by one. Here's the math derivation, how I caught it, why publishing the wrong result would have been worse than publishing nothing, and what the corrected run shows: 5/5 published heads recovered, +642% ICL-loss from targeted ablation.

Read the essay

safety evals9 min read

Why your LLM judge might be wrong, and how to measure it

"Claude judges Claude" is the default setup for safety evals and the most obvious critique. The fix isn't to abandon LLM judges; it's to ship a hand-labeled gold set, run the judge on it, and publish its accuracy alongside every subject-model number. Here's the design and honest limits.

Read the essay

agents8 min read

Pass rates aren't enough: failure-mode tags for coding agents

When a coding agent fails a task, "63% pass rate" tells you nothing about why it failed. I built an 8-tag categorization, hit_iteration_limit, edit_but_no_retest, never_ran_tests, exit_early, and so on, and how that changed my priors on which failures were worth fixing first.

Read the essay

Community & leadership

Building developer ecosystems.

Scaling dev communities from hundreds to hundreds of thousands, organizing global hackathons, and mentoring the next wave of ML engineers.

40K+

Developer community

Built and scaled the Aspecta.ai developer ecosystem to 40,000+ active builders worldwide through 8+ hands-on workshops across GenAI and Web3.

$50K

Global hackathon

Launched flagship event with 1,100+ participants, 110+ project submissions, and a $50K prize pool, two features shipped back into the product within six months.

350+

Engineers served

Internal on-call agent I shipped via Amazon Bedrock Agents automates escalation workflows for 350+ Prime Video engineers.

Technical workshops

GenAI and Web3 deep-dives at Aspecta.ai with runbooks, partner enablement docs, and curriculum used by hundreds of developers.

GDG

Google Developer Groups

Organizer. Technical meetups and mentorship on ML systems, production deployment, and applied AI for early-career engineers.

MLH

Major League Hacking

Judge at national-level student hackathons. Evaluate technical depth, project execution, and presentation craft across hundreds of student teams.

About

How I actually work.

I'm an engineer and independent researcher working on LLM agent safety. The thread that ties the day job, the paper, and the open-source repos together is the same: tool-using agents are getting put on the path to real production action faster than the safety scaffolding around them is being built. I'm trying to close that gap with code, with measurement, and with writing.

Right now I'm AI Consultant at Amneal Pharmaceuticals: building Claude-Code-based research agents that pull pharmaceutical product data from external sources at orders of magnitude better throughput than the manual baseline, with the safety gates wrapped tight enough that an agent never writes anything to internal databases without a verified call site. Before that, I was SDE II at Amazon Prime Video CoreTech: where I shipped LLM systems for a 200M+ user catalog and a Bedrock Agents on-call assistant for 350+ engineers, the production exposure that became the seed for the ICLR paper.

The ICLR 2026 workshop paper (Agents in the Wild, sole author) proposes GIRA: a five-layer safety gate that separates LLM proposal from action authorization, paired with OEP: an SRE-grounded eval protocol that measures blast radius, injection success rate, and unauthorized action rate. The eleven open-source repos either feed into that, claude-evals, swe-agent-lite, mech-interp-starter, or surround it, retrieval and agent-tooling I keep finding myself wanting day-to-day.

I write code that admits its failure modes: every repo ships with committed run artifacts so you can read the headline claim before installing, and the mech-interp repo has an explicit regression guard for the off-by-one I caught in my own v0.1. Currently interviewing for research-engineer and frontier-engineer roles at AI labs. Based in New York, open to relocation. STEM OPT, H-1B cap-exempt eligible.

Based in

New York, NY
Current role

AI Consultant · Amneal Pharmaceuticals
Focus

LLM agent safety · Evals · Interpretability
Latest paper

ICLR 2026 · Agents in the Wild ↗
Visa

STEM OPT · H-1B sponsorship
Résumé

Drive · PDF ↗
Email

dhruv17062000@gmail.com
Phone

908-476-3488
Elsewhere

GitHub · LinkedIn · Google Scholar

Education

Where I learned.

Rutgers University

New Brunswick, NJ

M.S. in Computer Science

GPA 3.9 / 4.0

Charusat University

Gujarat, India

B.S. in Computer Science & Engineering

GPA 3.7 / 4.0

Stack

What I reach for.

The tools I actually ship with, split between the day job at Amazon and the open-source portfolio.

Languages

Python 3.12 TypeScript Java SQL

LLM / ML

Claude API Bedrock Agents messages.parse Prompt caching MCP fastembed HuggingFace PyTorch Polars

Cloud / Backend

AWS SageMaker DynamoDB S3 / EMR FastAPI Next.js Docker PostgreSQL LanceDB · Qdrant · Redis · MongoDB

Ops & tooling

Airflow GitHub Actions Prometheus Grafana Evidently uv · pytest Typer · Rich

Contact

Let's make something verifiable.

Best path in: email. I reply within a day for role conversations; same for interesting open-source collaborations.

dhruv17062000@gmail.com ICLR 2026 paper ↗ Google Scholar ↗ GitHub ↗ LinkedIn ↗ Résumé ↗

LLM agent safety: designed, shipped, and published.

A paper on guarded agents.

Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

Where I've shipped.

Amneal Pharmaceuticals Current

Amazon

Aspecta.ai

NBVP Technologies

Indian Army Defense R&D

Eleven repos, actually shipped.

Safety, evals & interpretability

Reproducing Olsson et al. (Anthropic, 2022)

Safety eval suite with a calibrated judge

A tiny SWE-bench, with failure modes

Regression tests for LLM prompts

Retrieval & agent tooling

Local RAG with hybrid retrieval & Contextual Retrieval

MCP server for a personal Zettelkasten

Inline PR review comments via gh api

Mock interview CLI with four Claude personas

arXiv / OpenReview / ACL → structured summary

Trading & MLOps

Paper-first algorithmic trading

End-to-end ML pipeline for support analytics

Earlier applied work.

Personalized GenAI tutoring

SMB content automation

Enterprise document Q&A at 5M-doc scale

Four essays on what I built.

The story behind GIRA: why agent safety needs SRE-grade evaluation

Reproducing induction heads in GPT-2, and the bug I caught in my own v0.1

Why your LLM judge might be wrong, and how to measure it

Pass rates aren't enough: failure-mode tags for coding agents

Building developer ecosystems.

Developer community

Global hackathon

Engineers served

Technical workshops

Google Developer Groups

Major League Hacking

How I actually work.

Where I learned.

Rutgers University

Charusat University

What I reach for.

Let's make something verifiable.

Inline PR review comments via `gh api`