Open to research-engineer roles at frontier AI labs
ML Engineer · Amazon Prime Video · NYC

I build AI systems at scale — and publish the work.

SDE II at Amazon Prime Video. I architect LLM-powered systems processing petabyte-scale data across global regions — work that directly impacts 200M+ users at 99.9% uptime. On the side, I ship open-source research-adjacent tools — a mech-interp reproduction that recovers 5/5 of GPT-2's published induction heads, safety eval suites, agent benchmarks, and long-form writeups on each.

STEM OPT · H-1B sponsorship required · cap-exempt eligible
200M+
Users impacted at Amazon
99.9%
Multi-region uptime
5 / 5
GPT-2 induction heads recovered
11
Shipped open-source repos
Experience

Where I've shipped.

Five years across Amazon, a Series A startup, a manager role on a 9-person ML team, a government research org, and Indian Army defense R&D. Each role a step up in scope, complexity, and stakes.

Amazon

2025 — Present
Software Development Engineer II
Prime Video CoreTech · New York, NY
  • LLM-powered classification engine — Led development of the micro-genre system for Prime Video storefront. Architecture decisions and cross-team coordination with product, design, and ML platform to deliver personalized carousels for 200M+ global users.
  • Petabyte-scale ML pipelines — Architected real-time systems on AWS SageMaker, DynamoDB, S3, and EMR. 30% latency reduction and faster time-to-recommendation through batch optimization.
  • AI-powered internal agent — Spearheaded using Amazon Bedrock Agents. Automated on-call escalation workflows for 350+ Prime Video engineers.
  • Multi-region reliability — Orchestrated cross-regional load testing across IAD, PDX, DUB, and ZAZ. Horizontal scaling to millions of requests/sec at 99.9% uptime.

Aspecta.ai

2024 — 2025
Founding Engineer (IC + PM)
Early-stage startup · Santa Clara, CA
  • 0 → 400K users — Directed cross-functional teams spanning engineering, product, marketing, and finance. Scaled the developer ecosystem 1000× within 12 months.
  • Global hackathon leadership — Launched flagship event with 1,100+ participants and $50K+ prize pool. 110+ project submissions; directly influenced 2 product features shipped within 6 months.
  • Enterprise partnerships — Established strategic relationships with Google, Amazon, Microsoft and leading AI startups. 40% increase in developer engagement through co-developed initiatives.
  • Technical workshops — Built and facilitated 8+ workshops on GenAI and Web3. Produced comprehensive runbooks and partner enablement materials.

NBVP Technologies

2019 — 2022
Machine Learning Engineer
Gujarat, India
  • Real-time video violence detection — Designed and deployed a CNN + LSTM system in PyTorch. Processed 10,000+ frames/minute at 92.3% accuracy, improving surveillance alert efficiency by 40%.
  • Scalable vision pipelines — Built image preprocessing on Airflow + PostgreSQL, handling 50,000+ samples for an emotion recognition model at 91% accuracy.
  • Production interview portal — Full-stack React + Flask deployment for internal recruiting. 25% faster page loads, 15% end-user satisfaction lift, 85% adoption in the first month across 200+ users.
  • Linux-deployable segmentation — TensorFlow + OpenCV image segmentation supporting 20+ defense R&D initiatives with on-device inference.
  • ML classification in client-server tooling — Built and validated models inside a tactical AI simulation stack, reducing manual analysis cycles by 30%.
Open source

Eleven repos, actually shipped.

Weekends and evenings, in public. Every repo has CI, tests, and committed run artifacts you can read before installing. Grouped by what signal they send for frontier-lab hiring.

Safety, evals & interpretability

04 · v0.2
mech-interp-starter v0.2

Reproducing Olsson et al. (Anthropic, 2022)

Prefix-matching + copying score + head ablation on GPT-2 small. Numpy-only scoring math with an explicit off-by-one regression guard. Caught and transparently logged a bug in my own v0.1 instead of shipping the wrong result.

5 / 5
Published heads recovered in top-10
+642%
ICL loss under induction-head ablation
pythonpytorchtransformersnumpy
claude-evals v0.2

Safety eval suite with a calibrated judge

Sycophancy, refusal calibration (XSTest-style), jailbreak robustness. 52 hand-curated cases. v0.2 ships a hand-labeled gold set so you can measure the judge's accuracy before trusting any subject pass rate.

52
Hand-curated eval cases
31
Gold verdicts for judge calibration
pythonclaude APIpydanticmessages.parse
swe-agent-lite v0.2

A tiny SWE-bench, with failure modes

13 curated bug-fix tasks (3 multi-file), 5-tool agent surface, sandboxed workspace. Per-task failure-mode tags (hit_iteration_limit, edit_but_no_retest, exit_early) so you know why a task failed, not just that it did.

13
Curated bug-fix tasks
3
Multi-file (incl. 1 hard)
pythontool-usepytestsandboxed
prompt-gym v0.1

Regression tests for LLM prompts

YAML specs, four matchers (exact / contains / regex / llm_judge). Non-zero exit codes on failure and regression so it drops into CI as a gate.

pythonyamltyperci-gate

Retrieval & agent tooling

05 · v0.6+
personal-rag v0.7

Local RAG with hybrid retrieval & Contextual Retrieval

LanceDB + fastembed, hybrid BM25 / dense retrieval via RRF, Anthropic Contextual Retrieval, inline citations, watch-mode reindex, server-rendered web UI, similar for finding related chunks.

pythonlancedbfastembedBM25fastapi
mcp-zettel v0.6

MCP server for a personal Zettelkasten

[[wiki-link]] graph, keyword + semantic search, prompt templates, Mermaid diagrams, and hybrid link suggestions on note create via RRF fusion of keyword and semantic ranks.

pythonMCPfastembedRRF
claude-pr-reviewer v0.6

Inline PR review comments via gh api

CLI + GitHub Action. Per-file chunking for large diffs, .claude-review.yml repo config, diff-hash review cache, and a calibration harness that measures precision / recall against labelled finding sets.

pythonclaude APIunidiffgithub actions
agent-interviewer v0.6

Mock interview CLI with four Claude personas

Behavioral, system-design, coding, case. Per-dimension scoring grounded in the transcript, YAML persona packs, replay through a different model, side-by-side diff between feedback variants in a small web viewer.

pythonclaude APIfastapipydantic
paper-digest v0.6

arXiv / OpenReview / ACL → structured summary

Problem · method · results · limitations. Interactive follow-up Q&A grounded in the paper, reading-list batch mode, watch-a-folder auto-digest, searchable history.

pythonclaude APIpypdfwatchdog

Trading & MLOps

02 · v0.1
algo-trader v0.1

Paper-first algorithmic trading

Options wheel + mean-reversion strategies on ETFs. Walk-forward backtests with realistic slippage and costs, risk and compliance gates on every order, FastAPI + Next.js 15 control plane.

pythonpolarsfastapinext.jspostgres
mlops-customer-support v0.1

End-to-end ML pipeline for support analytics

PostgreSQL / MongoDB / Qdrant / Redis. HuggingFace sentiment / topic / NER. Prometheus + Grafana + Evidently for drift detection.

pythonhuggingfaceprometheusgrafanadocker
Projects

Earlier applied work.

Three self-directed builds predating the open-source portfolio above — each one shipped end-to-end with measured outcomes.

LLM TutorBot pilot

Personalized GenAI tutoring

Serverless tutoring assistant with domain fine-tuning and semantic-similarity routing (BERT). Pilot deployment measured user-side satisfaction and engagement.

95%
User satisfaction
2.4×
Engagement lift in pilots
huggingfacepytorchAWS LambdaBERTCUDA
AI Social Pilot shipped

SMB content automation

React Native platform combining Supabase + n8n + Claude Vision. Extracts brand identity from uploaded assets and auto-generates social captions + visuals with a swipe-approval flow for non-technical founders.

90%
Faster content creation
Posting efficiency
react nativesupabasen8nclaude vision
RAGWorks production

Enterprise document Q&A at 5M-doc scale

Retrieval-augmented QA service with caching and cross-encoder reranking. Benchmarked on a 5M-document corpus. The token-spend and latency wins came from the rerank + cache layer, not the base retrieval.

-58%
Token spend
+19%
NDCG@10
fastapifaisspgvectoropenaicross-encoder
Writing

Three essays on what I built.

Long-form writeups on the mechanistic interp reproduction, building a credible safety-eval judge, and why pass rates aren't enough for coding-agent benchmarks. Writing is how I think; these are the technical decisions I'd want a reviewer to see.

interpretability11 min read

Reproducing induction heads in GPT-2 — and the bug I caught in my own v0.1

I shipped a prefix-matching score, then realized my formula was off by one. Here's the math derivation, how I caught it, why publishing the wrong result would have been worse than publishing nothing, and what the corrected run shows: 5/5 published heads recovered, +642% ICL-loss from targeted ablation.

Read the essay
safety evals9 min read

Why your LLM judge might be wrong — and how to measure it

"Claude judges Claude" is the default setup for safety evals and the most obvious critique. The fix isn't to abandon LLM judges; it's to ship a hand-labeled gold set, run the judge on it, and publish its accuracy alongside every subject-model number. Here's the design and honest limits.

Read the essay
agents8 min read

Pass rates aren't enough: failure-mode tags for coding agents

When a coding agent fails a task, "63% pass rate" tells you nothing about why it failed. I built an 8-tag categorization — hit_iteration_limit, edit_but_no_retest, never_ran_tests, exit_early, and so on — and how that changed my priors on which failures were worth fixing first.

Read the essay
Community & leadership

Building developer ecosystems.

Scaling dev communities from hundreds to hundreds of thousands, organizing global hackathons, and mentoring the next wave of ML engineers.

40K+

Developer community

Built and scaled the Aspecta.ai developer ecosystem to 40,000+ active builders worldwide through 8+ hands-on workshops across GenAI and Web3.

$50K

Global hackathon

Launched flagship event with 1,100+ participants, 110+ project submissions, and a $50K prize pool — two features shipped back into the product within six months.

350+

Engineers served

Internal on-call agent I shipped via Amazon Bedrock Agents automates escalation workflows for 350+ Prime Video engineers.

8+

Technical workshops

GenAI and Web3 deep-dives at Aspecta.ai with runbooks, partner enablement docs, and curriculum used by hundreds of developers.

GDG

Google Developer Groups

Organizer. Technical meetups and mentorship on ML systems, production deployment, and applied AI for early-career engineers.

MLH

Major League Hacking

Judge at national-level student hackathons. Evaluate technical depth, project execution, and presentation craft across hundreds of student teams.

About

How I actually work.

I'm an ML engineer at Amazon Prime Video building LLM-powered systems that serve hundreds of millions of users at multi-region scale. The day job is production infrastructure — pipelines, agents, reliability engineering — on systems where a single regression ripples out to tens of millions of streams.

Outside of Amazon, I spend my evenings on open-source research-adjacent work: reproducing Anthropic's induction-heads result on GPT-2, building safety eval suites with calibrated judges, measuring coding agents with real failure-mode analysis. I try to write code that admits its failure modes. Every repo ships with committed run artifacts so you can see the headline claim before installing. The mech-interp repo has an explicit regression guard for the off-by-one I caught in my own v0.1 — if someone ever reverts the fix, it fails loudly.

Currently interviewing for research-engineer and frontier-engineer roles at AI labs. Based in New York, open to relocation. STEM OPT, H-1B cap-exempt eligible.

Education

Where I learned.

Rutgers University

New Brunswick, NJ
M.S. in Computer Science
GPA 3.9 / 4.0

Charusat University

Gujarat, India
B.S. in Computer Science & Engineering
GPA 3.7 / 4.0
Stack

What I reach for.

The tools I actually ship with, split between the day job at Amazon and the open-source portfolio.

Languages
Python 3.12 TypeScript Java SQL
LLM / ML
Claude API Bedrock Agents messages.parse Prompt caching MCP fastembed HuggingFace PyTorch Polars
Cloud / Backend
AWS SageMaker DynamoDB S3 / EMR FastAPI Next.js Docker PostgreSQL LanceDB · Qdrant · Redis · MongoDB
Ops & tooling
Airflow GitHub Actions Prometheus Grafana Evidently uv · pytest Typer · Rich
Contact

Let's make something verifiable.

Best path in: email. I reply within a day for role conversations; same for interesting open-source collaborations.