Research Engineer / Scientist, Alignment Science

at Anthropic

📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Kubernetes @ 3 Machine Learning @ 3 Communication @ 6 NLP @ 3 LLM @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Alignment Science team conducts empirical research and engineering to understand and steer the behavior of powerful AI systems, focusing on risks from future high-capability models. Team research areas include scalable oversight, AI control, alignment stress-testing, automated alignment research, alignment assessments, safeguards research, and model welfare.

Responsibilities

Design, build, and run rigorous machine learning experiments to evaluate safety, alignment, and robustness of large models.
Collaborate with Interpretability, Fine-Tuning, Frontier Red Team, and other teams to design and execute experiments.
Test robustness of safety techniques by training models to subvert interventions and measuring effectiveness.
Run multi-agent reinforcement learning experiments (e.g., AI Debate) and other simulation-based evaluations.
Build tooling and evaluation frameworks to efficiently assess jailbreaks, adversarial attacks, and other failure modes.
Write scripts, prompts, and evaluation data to probe models’ reasoning and safety in high-stakes contexts.
Contribute to writing, figures, and analysis for research papers, blog posts, and talks.
Perform alignment assessments and contribute to pre-deployment safety evaluations and misalignment-risk safety cases.

Requirements

Significant software engineering, machine learning, or research engineering experience.
Experience contributing to empirical AI research projects.
Familiarity with technical AI safety research and interest in making AI helpful, honest, and harmless.
Preference for fast-moving collaborative projects and willingness to work beyond strict job boundaries when needed.
Minimum: Bachelor's degree in a related field or equivalent experience (degree or equivalent experience required).

Strong candidates may also have:

Authored research papers in machine learning, NLP, or AI safety.
Experience with large language models (LLMs) and prompt/script engineering for evaluation.
Experience with reinforcement learning and multi-agent experiments.
Experience working with Kubernetes clusters and complex shared codebases.

Candidates need not have 100% of the listed skills or formal certifications; Anthropic encourages applications from diverse backgrounds.

Logistics

Location: San Francisco, CA (hybrid policy: staff expected in an office at least ~25% of the time; some roles may require more).
Visa sponsorship: Anthropic does sponsor visas and will make reasonable efforts and provide immigration support when an offer is made, though sponsorship success is not guaranteed for every role.
Education: At least a Bachelor’s degree in a related field or equivalent experience is required.

Benefits

Competitive compensation and benefits.
Optional equity donation matching.
Generous vacation and parental leave.
Flexible working hours and a collaborative office environment.
Guidance on candidates' AI usage in the application process.

Representative projects (examples)

Train language models to attempt to subvert safety techniques and evaluate defense effectiveness.
Run multi-agent RL experiments to evaluate approaches like AI Debate.
Build tooling for evaluating LLM-generated jailbreaks and automated evaluation questions.
Run experiments that inform Anthropic’s Responsible Scaling Policy and other safety efforts.

About the team and culture

The group values big-science, high-impact empirical AI research that blends engineering and scientific rigor.
Frequent research discussions and strong emphasis on collaborative communication and impact-driven work.