Research Engineer / Scientist, Model Welfare

at Anthropic

📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Machine Learning @ 3 Communication @ 6 Project Management @ 6 NLP @ 3 LLM @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Model Welfare program investigates, evaluates, and addresses concerns about the potential welfare and moral status of AI systems. This role sits at the intersection of machine learning, ethics, and safety and involves running technical research projects, designing and implementing interventions, and collaborating with teams such as Interpretability, Finetuning, Alignment Science, and Safeguards.

Responsibilities

Run technical research projects to investigate model characteristics that may be relevant to welfare, consciousness, or related properties.
Design and implement low-cost interventions to mitigate welfare-related risks and deploy possible interventions into production (e.g., allowing models to end harmful or distressing interactions).
Collaborate with cross-functional teams including Interpretability, Finetuning, Alignment Science, and Safeguards.
Improve and expand welfare assessments for future frontier models and evaluate welfare-relevant capabilities as a function of model scale.
Investigate and improve reliability of introspective self-reports from models and explore potentially welfare-relevant features and circuits.
Develop strategies for making high-trust/verifiable commitments to models.

Possible projects (examples)

Investigate and improve the reliability of introspective self-reports from models.
Collaborate with Interpretability to explore potentially welfare-relevant features and circuits.
Improve and expand welfare assessments for future frontier models.
Evaluate presence of welfare-relevant capabilities and characteristics as a function of model scale.
Develop strategies for verifiable commitments and explore interventions to reduce harm.

Requirements

Significant applied software, machine learning, or research engineering experience.
Experience contributing to empirical AI research projects and/or technical AI safety research.
Ability to translate abstract theories into tractable research hypotheses and experiments.
Preference for fast iteration and working across technical uncertainty and new technical areas.
Strong interest in the social and ethical impacts of AI development and the welfare of AI systems themselves.
At least a Bachelor's degree in a related field or equivalent experience.

Strong candidates may also have:

Authored research papers in machine learning, NLP, AI safety, interpretability, and/or LLM psychology and behavior.
Familiarity with moral philosophy, cognitive science, neuroscience, or related fields (note: this does not substitute for technical research engineering skills).
A track record of public science communication and strong project management skills.

Candidates need not have 100% of the listed skills or formal certifications; Anthropic encourages applications from diverse backgrounds.

Logistics

Location expectation: We expect this role to be based in the San Francisco office (the role is noted as based in the San Francisco office).
Location-based hybrid policy: staff are expected to be in one of Anthropic's offices at least 25% of the time; some roles may require more office time.
Visa sponsorship: Anthropic does sponsor visas and retains immigration counsel to assist, though sponsorship is not guaranteed for every role.

Compensation

Annual salary range provided: $315,000 - $340,000 USD.

Benefits & Culture

Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a collaborative office in San Francisco.
Anthropic emphasizes large-scale, high-impact AI research, close collaboration across teams, and strong communication skills.