Research Engineer, Model Evaluations

USD 320,000-485,000 per year
MIDDLE
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning LLM

Required Skills & Competences

Python @ 6 Statistics @ 3 Distributed Systems @ 3 Leadership @ 3 Communication @ 3 Slack @ 3 Observability @ 3 AI @ 3 Data Visualization @ 3 Data Pipelines @ 3

Details

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The team is a growing group of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

Role overview

You will build evaluations that measure what Claude can actually do, turning ambiguous notions of "intelligence" into clear, defensible metrics. Work spans designing and implementing evaluations across capabilities and personality, and building infrastructure to run those evaluations reliably at scale. You will partner closely with researchers through the lifecycle of new capabilities — from defining what to measure to running evals against live training checkpoints and interpreting results.

Responsibilities

  • Design and run new evaluations of Claude's capabilities (reasoning, agentic behavior, knowledge, safety properties) and produce visualizations that make results legible to researchers and decision-makers
  • Build and harden the distributed eval execution platform so hundreds of evals run reliably against checkpoints during production RL training runs
  • Own dashboards researchers and leadership use to monitor model health during training; improve signal-to-noise, reduce latency, and make regressions obvious
  • Debug anomalous eval results mid-training-run, determine whether causes are model changes or infrastructure issues, and communicate clearly under time pressure
  • Improve tooling, libraries, and workflows researchers use to implement and iterate on evaluations
  • Partner with research teams across the full lifecycle of a new capability — from defining measurements to interpreting results as training progresses
  • Run experiments to characterize how prompting, sampling, and scaffolding choices affect results on internal and industry benchmarks
  • Communicate evaluations and their results to internal stakeholders and, where appropriate, external audiences

Minimum qualifications

  • Strong Python programming skills, including production or research infrastructure
  • Experience building or operating distributed systems, data pipelines, or other infrastructure that needs to be reliable at scale
  • Clear written and verbal communication, especially when explaining technical results to non-specialists
  • Comfort operating in an on-call or production-support capacity when training runs are live
  • Care about the societal impacts of your work and an interest in steering powerful AI to be safe and beneficial

Preferred qualifications

  • Hands-on experience using large language models (e.g., Claude), including prompting, sampling, and scaffolding
  • Background in data visualization and a track record of building dashboards people trust and use
  • Experience developing robust evaluation metrics for language models
  • Experience with observability, monitoring, or experiment-tracking systems
  • Background in statistics and experimental design
  • Experience with large-scale dataset sourcing, curation, and processing
  • Experience running or supporting ML training infrastructure
  • Bias toward picking up slack and operating flexibly across team boundaries; enjoyment of pair programming

Representative projects

  • Stand up a new eval from scratch: define the task, build the dataset, implement scoring, validate against known signals, and ship a dashboard
  • Diagnose a mid-training regression and determine within hours whether it’s the model, harness, data, or infrastructure
  • Make a flaky distributed eval pipeline reliable: better retries, observability, and faster feedback
  • Partner with a research team to translate what "good" looks like into measurable artifacts

Compensation

Annual Salary: $320,000 - $485,000 USD

Logistics

  • Minimum education: Bachelor’s degree or equivalent experience
  • Location-based hybrid policy: staff expected to be in one of the offices at least 25% of the time (some roles may require more)
  • Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist with visa processes

Benefits

Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

How we're different

Anthropic focuses on large-scale, high-impact AI research and values collaboration and strong communication. The team views AI research as an empirical science and hosts frequent research discussions to pursue high-impact work.