Research Engineer, Model Evaluations

at Anthropic

📍 World
📍 New York City, United States
📍 San Francisco, United States

USD 320,000-485,000 per year

MIDDLE

✅ Hybrid

✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning LLM

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 6 Statistics @ 3 Distributed Systems @ 3 Leadership @ 3 Communication @ 3 Slack @ 3 Observability @ 3 AI @ 3 Data Visualization @ 3 Data Pipelines @ 3

Details

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The team is a growing group of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

Role overview

You will build evaluations that measure what Claude can actually do, turning ambiguous notions of "intelligence" into clear, defensible metrics. Work spans designing and implementing evaluations across capabilities and personality, and building infrastructure to run those evaluations reliably at scale. You will partner closely with researchers through the lifecycle of new capabilities — from defining what to measure to running evals against live training checkpoints and interpreting results.

Responsibilities

Design and run new evaluations of Claude's capabilities (reasoning, agentic behavior, knowledge, safety properties) and produce visualizations that make results legible to researchers and decision-makers
Build and harden the distributed eval execution platform so hundreds of evals run reliably against checkpoints during production RL training runs
Own dashboards researchers and leadership use to monitor model health during training; improve signal-to-noise, reduce latency, and make regressions obvious
Debug anomalous eval results mid-training-run, determine whether causes are model changes or infrastructure issues, and communicate clearly under time pressure
Improve tooling, libraries, and workflows researchers use to implement and iterate on evaluations
Partner with research teams across the full lifecycle of a new capability — from defining measurements to interpreting results as training progresses
Run experiments to characterize how prompting, sampling, and scaffolding choices affect results on internal and industry benchmarks
Communicate evaluations and their results to internal stakeholders and, where appropriate, external audiences

Minimum qualifications

Strong Python programming skills, including production or research infrastructure
Experience building or operating distributed systems, data pipelines, or other infrastructure that needs to be reliable at scale
Clear written and verbal communication, especially when explaining technical results to non-specialists
Comfort operating in an on-call or production-support capacity when training runs are live
Care about the societal impacts of your work and an interest in steering powerful AI to be safe and beneficial

Preferred qualifications

Hands-on experience using large language models (e.g., Claude), including prompting, sampling, and scaffolding
Background in data visualization and a track record of building dashboards people trust and use
Experience developing robust evaluation metrics for language models
Experience with observability, monitoring, or experiment-tracking systems
Background in statistics and experimental design
Experience with large-scale dataset sourcing, curation, and processing
Experience running or supporting ML training infrastructure
Bias toward picking up slack and operating flexibly across team boundaries; enjoyment of pair programming

Representative projects

Stand up a new eval from scratch: define the task, build the dataset, implement scoring, validate against known signals, and ship a dashboard
Diagnose a mid-training regression and determine within hours whether it’s the model, harness, data, or infrastructure
Make a flaky distributed eval pipeline reliable: better retries, observability, and faster feedback
Partner with a research team to translate what "good" looks like into measurable artifacts

Compensation

Annual Salary: $320,000 - $485,000 USD

Logistics

Minimum education: Bachelor’s degree or equivalent experience
Location-based hybrid policy: staff expected to be in one of the offices at least 25% of the time (some roles may require more)
Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist with visa processes

Benefits

Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

How we're different

Anthropic focuses on large-scale, high-impact AI research and values collaboration and strong communication. The team views AI research as an empirical science and hosts frequent research discussions to pursue high-impact work.