Software Engineer, Safeguards Evals

at Anthropic

📍 New York City, United States
📍 San Francisco, United States

USD 320,000-485,000 per year

MIDDLE

✅ Hybrid

✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 5 Distributed Systems @ 3 Communication @ 3 Data Analysis @ 6 LLM @ 3 AI @ 3 Data Pipelines @ 3 Prompt Engineering @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. This role sits at the intersection of applied ML research and engineering to build the evaluation infrastructure that measures whether investigative/monitoring agents reliably catch misuse of Claude. Work includes designing experiments, building datasets that reflect real abuse, shipping evaluation methods into pipelines that gate system changes, and constructing RL environments to improve investigation capabilities.

Responsibilities

Build and own the evaluation harness for an agentic investigation system — define metrics, test cases, and grading approaches for a complex long-horizon agent.
Construct high-quality evaluation datasets representing real-world misuse across harm areas (e.g., cyber attacks, bio weapons, influence operations), drawing from real traffic patterns and synthetic generation.
Measure agent performance end-to-end (detection precision/recall, investigation quality, robustness) and drive hill-climbing on the hardest harm areas.
Analyze coverage to identify measurement gaps and evolve evaluations so they remain unsaturated and high-signal as agent capabilities advance.
Productionize successful research into regression and release pipelines that run on every agent change, prompt update, and underlying model upgrade.
Build tooling that enables policy experts to author, run, and iterate on evaluations without engineering support.
Construct RL environments to improve Claude’s safety investigation capabilities.

Minimum qualifications

Proficiency in Python and comfort working across the stack.
Experience building and maintaining data pipelines.
Experience working with LLMs and a working understanding of their capabilities and failure modes — especially agentic systems with tool use and multi-step reasoning.
Strong data analysis skills — ability to draw reliable insights from large datasets.
Ability to move fluidly between research prototyping and production-quality code.
Ability to translate ambiguous problems into concrete, testable experiments.

Preferred qualifications

6+ years of industry software engineering experience.
Expertise in building or contributing to agent evaluation frameworks, benchmarks, or automated grading systems.
Extensive experience in trust and safety, content moderation, or abuse detection systems.
Experience in red teaming, adversarial testing, or jailbreak research on AI systems.
Experience with synthetic data generation or data augmentation.
Experience with distributed systems or large-scale data processing.
Experience with prompt engineering or building LLM-powered applications.

Compensation

Annual Salary: $320,000 - $485,000 USD

Logistics

Minimum education: Bachelor’s degree or equivalent combination of education, training, and/or experience.
Location-based hybrid policy: staff expected to be in one of Anthropic's offices at least 25% of the time; some roles may require more office time.
Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to help where possible.

Benefits

Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and an office space for collaboration.

How we're different

Anthropic emphasizes large-scale, collaborative research efforts focused on steerable, trustworthy AI and values communication and cross-disciplinary collaboration. Research directions referenced include prior work on GPT-3, interpretability, scaling laws, and learning from human preferences.