Open to research-engineer roles at frontier AI labs

ML Engineer · Amazon Prime Video · NYC

I build AI systems at scale — and publish the work.

SDE II at Amazon Prime Video. I architect LLM-powered systems processing petabyte-scale data across global regions — work that directly impacts 200M+ users at 99.9% uptime. On the side, I ship open-source research-adjacent tools — a mech-interp reproduction that recovers 5/5 of GPT-2's published induction heads, safety eval suites, agent benchmarks, and long-form writeups on each.

STEM OPT · H-1B sponsorship required · cap-exempt eligible

200M+

Users impacted at Amazon

99.9%

Multi-region uptime

5 / 5

GPT-2 induction heads recovered

Shipped open-source repos

Get in touch See the work

gpt-2 · prefix-match score live result

low

partial

induction head

Experience

Where I've shipped.

Five years across Amazon, a Series A startup, a manager role on a 9-person ML team, a government research org, and Indian Army defense R&D. Each role a step up in scope, complexity, and stakes.

Amazon

2025 — Present

Software Development Engineer II

Prime Video CoreTech · New York, NY

LLM-powered classification engine — Led development of the micro-genre system for Prime Video storefront. Architecture decisions and cross-team coordination with product, design, and ML platform to deliver personalized carousels for 200M+ global users.
Petabyte-scale ML pipelines — Architected real-time systems on AWS SageMaker, DynamoDB, S3, and EMR. 30% latency reduction and faster time-to-recommendation through batch optimization.
AI-powered internal agent — Spearheaded using Amazon Bedrock Agents. Automated on-call escalation workflows for 350+ Prime Video engineers.
Multi-region reliability — Orchestrated cross-regional load testing across IAD, PDX, DUB, and ZAZ. Horizontal scaling to millions of requests/sec at 99.9% uptime.

Aspecta.ai

2024 — 2025

Founding Engineer (IC + PM)

Early-stage startup · Santa Clara, CA

0 → 400K users — Directed cross-functional teams spanning engineering, product, marketing, and finance. Scaled the developer ecosystem 1000× within 12 months.
Global hackathon leadership — Launched flagship event with 1,100+ participants and $50K+ prize pool. 110+ project submissions; directly influenced 2 product features shipped within 6 months.
Enterprise partnerships — Established strategic relationships with Google, Amazon, Microsoft and leading AI startups. 40% increase in developer engagement through co-developed initiatives.
Technical workshops — Built and facilitated 8+ workshops on GenAI and Web3. Produced comprehensive runbooks and partner enablement materials.

NBVP Technologies

2019 — 2022

Machine Learning Engineer

Gujarat, India

Real-time video violence detection — Designed and deployed a CNN + LSTM system in PyTorch. Processed 10,000+ frames/minute at 92.3% accuracy, improving surveillance alert efficiency by 40%.
Scalable vision pipelines — Built image preprocessing on Airflow + PostgreSQL, handling 50,000+ samples for an emotion recognition model at 91% accuracy.
Production interview portal — Full-stack React + Flask deployment for internal recruiting. 25% faster page loads, 15% end-user satisfaction lift, 85% adoption in the first month across 200+ users.
Linux-deployable segmentation — TensorFlow + OpenCV image segmentation supporting 20+ defense R&D initiatives with on-device inference.
ML classification in client-server tooling — Built and validated models inside a tactical AI simulation stack, reducing manual analysis cycles by 30%.

Open source

Eleven repos, actually shipped.

Weekends and evenings, in public. Every repo has CI, tests, and committed run artifacts you can read before installing. Grouped by what signal they send for frontier-lab hiring.

Safety, evals & interpretability

04 · v0.2

mech-interp-starter v0.2

Reproducing Olsson et al. (Anthropic, 2022)

Prefix-matching + copying score + head ablation on GPT-2 small. Numpy-only scoring math with an explicit off-by-one regression guard. Caught and transparently logged a bug in my own v0.1 instead of shipping the wrong result.

5 / 5

Published heads recovered in top-10

+642%

ICL loss under induction-head ablation

pythonpytorchtransformersnumpy

repo artifacts writeup

claude-evals v0.2

Safety eval suite with a calibrated judge

Sycophancy, refusal calibration (XSTest-style), jailbreak robustness. 52 hand-curated cases. v0.2 ships a hand-labeled gold set so you can measure the judge's accuracy before trusting any subject pass rate.

Hand-curated eval cases

Gold verdicts for judge calibration

pythonclaude APIpydanticmessages.parse

repo artifact writeup

swe-agent-lite v0.2

A tiny SWE-bench, with failure modes

13 curated bug-fix tasks (3 multi-file), 5-tool agent surface, sandboxed workspace. Per-task failure-mode tags (hit_iteration_limit, edit_but_no_retest, exit_early) so you know why a task failed, not just that it did.

Curated bug-fix tasks

Multi-file (incl. 1 hard)

pythontool-usepytestsandboxed

repo artifact writeup

prompt-gym v0.1

Regression tests for LLM prompts

YAML specs, four matchers (exact / contains / regex / llm_judge). Non-zero exit codes on failure and regression so it drops into CI as a gate.

pythonyamltyperci-gate

repo

Retrieval & agent tooling

05 · v0.6+

personal-rag v0.7

Local RAG with hybrid retrieval & Contextual Retrieval

LanceDB + fastembed, hybrid BM25 / dense retrieval via RRF, Anthropic Contextual Retrieval, inline citations, watch-mode reindex, server-rendered web UI, similar for finding related chunks.

pythonlancedbfastembedBM25fastapi

repo

mcp-zettel v0.6

MCP server for a personal Zettelkasten

[[wiki-link]] graph, keyword + semantic search, prompt templates, Mermaid diagrams, and hybrid link suggestions on note create via RRF fusion of keyword and semantic ranks.

pythonMCPfastembedRRF

repo

claude-pr-reviewer v0.6

Inline PR review comments via `gh api`

CLI + GitHub Action. Per-file chunking for large diffs, .claude-review.yml repo config, diff-hash review cache, and a calibration harness that measures precision / recall against labelled finding sets.

pythonclaude APIunidiffgithub actions

repo

agent-interviewer v0.6

Mock interview CLI with four Claude personas

Behavioral, system-design, coding, case. Per-dimension scoring grounded in the transcript, YAML persona packs, replay through a different model, side-by-side diff between feedback variants in a small web viewer.

pythonclaude APIfastapipydantic

repo

paper-digest v0.6

arXiv / OpenReview / ACL → structured summary

Problem · method · results · limitations. Interactive follow-up Q&A grounded in the paper, reading-list batch mode, watch-a-folder auto-digest, searchable history.

pythonclaude APIpypdfwatchdog

repo

Trading & MLOps

02 · v0.1

algo-trader v0.1

Paper-first algorithmic trading

Options wheel + mean-reversion strategies on ETFs. Walk-forward backtests with realistic slippage and costs, risk and compliance gates on every order, FastAPI + Next.js 15 control plane.

pythonpolarsfastapinext.jspostgres

repo

mlops-customer-support v0.1

End-to-end ML pipeline for support analytics

PostgreSQL / MongoDB / Qdrant / Redis. HuggingFace sentiment / topic / NER. Prometheus + Grafana + Evidently for drift detection.

pythonhuggingfaceprometheusgrafanadocker

repo

Projects

Earlier applied work.

Three self-directed builds predating the open-source portfolio above — each one shipped end-to-end with measured outcomes.

LLM TutorBot pilot

Personalized GenAI tutoring

Serverless tutoring assistant with domain fine-tuning and semantic-similarity routing (BERT). Pilot deployment measured user-side satisfaction and engagement.

95%

User satisfaction

2.4×

Engagement lift in pilots

huggingfacepytorchAWS LambdaBERTCUDA

AI Social Pilot shipped

SMB content automation

React Native platform combining Supabase + n8n + Claude Vision. Extracts brand identity from uploaded assets and auto-generates social captions + visuals with a swipe-approval flow for non-technical founders.

90%

Faster content creation

5×

Posting efficiency

react nativesupabasen8nclaude vision

RAGWorks production

Enterprise document Q&A at 5M-doc scale

Retrieval-augmented QA service with caching and cross-encoder reranking. Benchmarked on a 5M-document corpus. The token-spend and latency wins came from the rerank + cache layer, not the base retrieval.

-58%

Token spend

+19%

NDCG@10

fastapifaisspgvectoropenaicross-encoder

P95 latency: 180ms → 60ms

Writing

Three essays on what I built.

Long-form writeups on the mechanistic interp reproduction, building a credible safety-eval judge, and why pass rates aren't enough for coding-agent benchmarks. Writing is how I think; these are the technical decisions I'd want a reviewer to see.

interpretability11 min read

Reproducing induction heads in GPT-2 — and the bug I caught in my own v0.1

I shipped a prefix-matching score, then realized my formula was off by one. Here's the math derivation, how I caught it, why publishing the wrong result would have been worse than publishing nothing, and what the corrected run shows: 5/5 published heads recovered, +642% ICL-loss from targeted ablation.

Read the essay

safety evals9 min read

Why your LLM judge might be wrong — and how to measure it

"Claude judges Claude" is the default setup for safety evals and the most obvious critique. The fix isn't to abandon LLM judges; it's to ship a hand-labeled gold set, run the judge on it, and publish its accuracy alongside every subject-model number. Here's the design and honest limits.

Read the essay

agents8 min read

Pass rates aren't enough: failure-mode tags for coding agents

When a coding agent fails a task, "63% pass rate" tells you nothing about why it failed. I built an 8-tag categorization — hit_iteration_limit, edit_but_no_retest, never_ran_tests, exit_early, and so on — and how that changed my priors on which failures were worth fixing first.

Read the essay

Community & leadership

Building developer ecosystems.

Scaling dev communities from hundreds to hundreds of thousands, organizing global hackathons, and mentoring the next wave of ML engineers.

40K+

Developer community

Built and scaled the Aspecta.ai developer ecosystem to 40,000+ active builders worldwide through 8+ hands-on workshops across GenAI and Web3.

$50K

Global hackathon

Launched flagship event with 1,100+ participants, 110+ project submissions, and a $50K prize pool — two features shipped back into the product within six months.

350+

Engineers served

Internal on-call agent I shipped via Amazon Bedrock Agents automates escalation workflows for 350+ Prime Video engineers.

Technical workshops

GenAI and Web3 deep-dives at Aspecta.ai with runbooks, partner enablement docs, and curriculum used by hundreds of developers.

GDG

Google Developer Groups

Organizer. Technical meetups and mentorship on ML systems, production deployment, and applied AI for early-career engineers.

MLH

Major League Hacking

Judge at national-level student hackathons. Evaluate technical depth, project execution, and presentation craft across hundreds of student teams.

About

How I actually work.

I'm an ML engineer at Amazon Prime Video building LLM-powered systems that serve hundreds of millions of users at multi-region scale. The day job is production infrastructure — pipelines, agents, reliability engineering — on systems where a single regression ripples out to tens of millions of streams.

Outside of Amazon, I spend my evenings on open-source research-adjacent work: reproducing Anthropic's induction-heads result on GPT-2, building safety eval suites with calibrated judges, measuring coding agents with real failure-mode analysis. I try to write code that admits its failure modes. Every repo ships with committed run artifacts so you can see the headline claim before installing. The mech-interp repo has an explicit regression guard for the off-by-one I caught in my own v0.1 — if someone ever reverts the fix, it fails loudly.

Currently interviewing for research-engineer and frontier-engineer roles at AI labs. Based in New York, open to relocation. STEM OPT, H-1B cap-exempt eligible.

Based in

New York, NY
Current role

SDE II · Amazon Prime Video
Focus

Safety evals · Interpretability · Agents
Visa

STEM OPT · H-1B sponsorship
Résumé

Drive · PDF ↗
Email

dhruv17062000@gmail.com
Phone

908-476-3488
Elsewhere

GitHub · LinkedIn

Education

Where I learned.

Rutgers University

New Brunswick, NJ

M.S. in Computer Science

GPA 3.9 / 4.0

Charusat University

Gujarat, India

B.S. in Computer Science & Engineering

GPA 3.7 / 4.0

Stack

What I reach for.

The tools I actually ship with, split between the day job at Amazon and the open-source portfolio.

Languages

Python 3.12 TypeScript Java SQL

LLM / ML

Claude API Bedrock Agents messages.parse Prompt caching MCP fastembed HuggingFace PyTorch Polars

Cloud / Backend

AWS SageMaker DynamoDB S3 / EMR FastAPI Next.js Docker PostgreSQL LanceDB · Qdrant · Redis · MongoDB

Ops & tooling

Airflow GitHub Actions Prometheus Grafana Evidently uv · pytest Typer · Rich

Contact

Let's make something verifiable.

Best path in: email. I reply within a day for role conversations; same for interesting open-source collaborations.

dhruv17062000@gmail.com GitHub ↗ LinkedIn ↗ Résumé ↗

I build AI systems at scale — and publish the work.

Where I've shipped.

Amazon

Aspecta.ai

NBVP Technologies

Eleven repos, actually shipped.

Safety, evals & interpretability

Reproducing Olsson et al. (Anthropic, 2022)

Safety eval suite with a calibrated judge

A tiny SWE-bench, with failure modes

Regression tests for LLM prompts

Retrieval & agent tooling

Local RAG with hybrid retrieval & Contextual Retrieval

MCP server for a personal Zettelkasten

Inline PR review comments via gh api

Mock interview CLI with four Claude personas

arXiv / OpenReview / ACL → structured summary

Trading & MLOps

Paper-first algorithmic trading

End-to-end ML pipeline for support analytics

Earlier applied work.

Personalized GenAI tutoring

SMB content automation

Enterprise document Q&A at 5M-doc scale

Three essays on what I built.

Reproducing induction heads in GPT-2 — and the bug I caught in my own v0.1

Why your LLM judge might be wrong — and how to measure it

Pass rates aren't enough: failure-mode tags for coding agents

Building developer ecosystems.

Developer community

Global hackathon

Engineers served

Technical workshops

Google Developer Groups

Major League Hacking

How I actually work.

Where I learned.

Rutgers University

Charusat University

What I reach for.

Let's make something verifiable.

Inline PR review comments via `gh api`