Used Tools & Technologies
Machine Learning LLMRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 6
Statistics @ 3
Distributed Systems @ 3
Leadership @ 3
Communication @ 3
Slack @ 3
Observability @ 3
AI @ 3
Data Visualization @ 3
Data Pipelines @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
About Anthropic
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The team is a growing group of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.
Role overview
You will build evaluations that measure what Claude can actually do, turning ambiguous notions of "intelligence" into clear, defensible metrics. Work spans designing and implementing evaluations across capabilities and personality, and building infrastructure to run those evaluations reliably at scale. You will partner closely with researchers through the lifecycle of new capabilities — from defining what to measure to running evals against live training checkpoints and interpreting results.
Responsibilities
- Design and run new evaluations of Claude's capabilities (reasoning, agentic behavior, knowledge, safety properties) and produce visualizations that make results legible to researchers and decision-makers
- Build and harden the distributed eval execution platform so hundreds of evals run reliably against checkpoints during production RL training runs
- Own dashboards researchers and leadership use to monitor model health during training; improve signal-to-noise, reduce latency, and make regressions obvious
- Debug anomalous eval results mid-training-run, determine whether causes are model changes or infrastructure issues, and communicate clearly under time pressure
- Improve tooling, libraries, and workflows researchers use to implement and iterate on evaluations
- Partner with research teams across the full lifecycle of a new capability — from defining measurements to interpreting results as training progresses
- Run experiments to characterize how prompting, sampling, and scaffolding choices affect results on internal and industry benchmarks
- Communicate evaluations and their results to internal stakeholders and, where appropriate, external audiences
Minimum qualifications
- Strong Python programming skills, including production or research infrastructure
- Experience building or operating distributed systems, data pipelines, or other infrastructure that needs to be reliable at scale
- Clear written and verbal communication, especially when explaining technical results to non-specialists
- Comfort operating in an on-call or production-support capacity when training runs are live
- Care about the societal impacts of your work and an interest in steering powerful AI to be safe and beneficial
Preferred qualifications
- Hands-on experience using large language models (e.g., Claude), including prompting, sampling, and scaffolding
- Background in data visualization and a track record of building dashboards people trust and use
- Experience developing robust evaluation metrics for language models
- Experience with observability, monitoring, or experiment-tracking systems
- Background in statistics and experimental design
- Experience with large-scale dataset sourcing, curation, and processing
- Experience running or supporting ML training infrastructure
- Bias toward picking up slack and operating flexibly across team boundaries; enjoyment of pair programming
Representative projects
- Stand up a new eval from scratch: define the task, build the dataset, implement scoring, validate against known signals, and ship a dashboard
- Diagnose a mid-training regression and determine within hours whether it’s the model, harness, data, or infrastructure
- Make a flaky distributed eval pipeline reliable: better retries, observability, and faster feedback
- Partner with a research team to translate what "good" looks like into measurable artifacts
Compensation
Annual Salary: $320,000 - $485,000 USD
Logistics
- Minimum education: Bachelor’s degree or equivalent experience
- Location-based hybrid policy: staff expected to be in one of the offices at least 25% of the time (some roles may require more)
- Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist with visa processes
Benefits
Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.
How we're different
Anthropic focuses on large-scale, high-impact AI research and values collaboration and strong communication. The team views AI research as an empirical science and hosts frequent research discussions to pursue high-impact work.