Research Engineer / Scientist, Alignment Science

at Anthropic

📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Kubernetes @ 3 Python @ 3 Machine Learning @ 3 NLP @ 3 LLM @ 3

Details

You will build and run machine learning experiments to help understand and steer the behavior of powerful AI systems. The role combines scientific and engineering work focused on AI safety and alignment, addressing risks from highly capable future systems. Interviews are conducted in Python and Anthropic prefers candidates to be based in the Bay Area.

Responsibilities

Design, implement, and run elegant, thorough ML experiments to probe model behavior and alignment.
Contribute exploratory experimental research on AI safety, often collaborating with Interpretability, Fine-Tuning, and Frontier Red Team.
Test robustness of safety techniques by training language models to subvert those techniques and measuring effectiveness.
Run multi-agent reinforcement learning experiments (e.g., approaches like AI Debate) to evaluate techniques.
Build tooling to evaluate effectiveness of novel LLM-generated jailbreaks and produce evaluation datasets via scripting and prompts.
Write scripts and prompts to efficiently generate evaluation questions testing models' reasoning in safety-relevant contexts.
Contribute to research papers, blog posts, figures, and talks.
Run experiments feeding into Anthropic efforts such as the Responsible Scaling Policy and pre-deployment alignment/welfare assessments.

Requirements

Significant software, machine learning, or research engineering experience.
Some experience contributing to empirical AI research projects.
Some familiarity with technical AI safety research.
Comfortable with fast-moving collaborative projects and taking ownership beyond a narrow job description.
Prefer candidates based in the Bay Area; interviews conducted in Python.
Education: at least a Bachelor's degree in a related field or equivalent experience is required.

Strong candidates may also have:

Experience authoring research papers in machine learning, NLP, or AI safety.
Experience with large language models (LLMs) and fine-tuning.
Experience with reinforcement learning and multi-agent RL experiments.
Experience with Kubernetes clusters and complex shared codebases.

Candidates need not have 100% of the skills listed; formal certifications or degrees are not strictly required.

Benefits & Compensation

Annual base salary: $315,000 - $340,000 USD.
Total compensation package includes equity, benefits, and may include incentive compensation.
Anthropic offers competitive compensation, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a Bay Area office environment.

Logistics

Location-based hybrid policy: staff are expected to be in one of Anthropic's offices at least 25% of the time (role-specific requirements may vary).
Visa sponsorship: Anthropic does sponsor visas where feasible and retains an immigration lawyer to assist when offers are made.
The team values diversity and encourages applicants from underrepresented groups.

About the Team & Research Focus

The Alignment Science team studies topics including scalable oversight, AI control, alignment stress-testing, automated alignment research, alignment assessments, safeguards research, and model welfare. Representative projects include robustness testing of safety techniques, multi-agent RL experiments, tooling for jailbreak evaluation, prompt and script development for evaluations, and contribution to policy and assessment efforts.

How to Apply

The listing includes standard application fields (contact info, resume/CV, links to work, team preferences, and visa/relocation questions). Anthropic provides guidance on candidate AI usage during the application process.