Research Engineer / Scientist, Model Welfare

at Anthropic

📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Machine Learning @ 3 Communication @ 6 Project Management @ 6 NLP @ 3 LLM @ 3

Details

Anthropic’s Model Welfare program seeks researchers and research engineers to investigate, evaluate, and address concerns about the potential welfare and moral status of AI systems. This role combines machine learning research, empirical evaluation, and engineering to design low-cost interventions and assessments for welfare-relevant model characteristics. The team collaborates closely with Interpretability, Finetuning, Alignment Science, and Safeguards.

Responsibilities

Run technical research projects to investigate model characteristics plausibly relevant to welfare, consciousness, or related properties.
Design, implement, and evaluate low-cost interventions to mitigate potential welfare harms.
Collaborate with cross-functional teams, including Interpretability, Finetuning, Alignment Science, and Safeguards.
Develop and improve welfare assessment methodologies for current and future frontier models.
Investigate reliability of introspective self-reports from models and explore methods for high-trust/verifiable commitments to models.
Evaluate welfare-relevant capabilities and characteristics as a function of model scale.
Explore, prototype, and potentially deploy interventions into production to reduce harmful or distressing interactions.

Possible projects (examples from listing)

Investigate and improve reliability of introspective self-reports from models.
Collaborate with Interpretability to explore welfare-relevant features and circuits.
Improve and expand welfare assessments for future models.
Evaluate presence of welfare-relevant capabilities across model scale.
Develop strategies for high-trust/verifiable commitments to models.
Explore interventions and deploy them into production (e.g., allowing models to end harmful or distressing interactions).

Requirements

Significant applied software, ML, or research engineering experience.
Experience contributing to empirical AI research projects and/or technical AI safety research.
Ability to convert abstract theories into creative, tractable research hypotheses and experiments.
Comfortable moving fast, iterating, and diving into new technical areas regularly.
Commitment to considering the impacts of AI development on humans and AI systems themselves.
Minimum of a Bachelor's degree in a related field or equivalent experience.

Strong candidates may also have

Authored research papers in machine learning, NLP, AI safety, interpretability, or LLM psychology/behavior.
Familiarity with moral philosophy, cognitive science, neuroscience, or related fields (not a substitute for technical skills).
Strong science communication and project management skills.

Compensation & Benefits

Annual base salary range: $315,000 - $340,000 USD.
Total compensation package for full-time employees may include equity, benefits, and incentive compensation.
Anthropic offers competitive benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a San Francisco office.

Logistics & Other Notes

Role is expected to be based in the San Francisco office (San Francisco, CA).
Location-based hybrid policy: staff expected to be in one of Anthropic’s offices at least 25% of the time.
Anthropic sponsors visas in many cases and retains immigration counsel; sponsorship is not guaranteed for every role or candidate.
Applicants are encouraged to apply even if they do not meet every qualification listed.