Research Scientist, Interpretability

at Anthropic

📍 San Francisco, United States

USD 315,000-560,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 2 Algorithms @ 3 Data Analysis @ 3

Details

Anthropic’s Interpretability team is seeking researchers and engineers to reverse-engineer how language models work, with a focus on mechanistic interpretability — discovering how neural network parameters implement meaningful algorithms. The team combines experimental research, engineering, and collaboration across Anthropic to build tools, run experiments at scale, and produce mechanistic accounts of model behavior.

Responsibilities

Develop methods to understand large language models by reverse engineering algorithms learned in weights
Design and run robust experiments, both quickly in toy scenarios and at scale in large models
Create and analyze interpretability features and circuits to understand model computation
Build infrastructure for running experiments and visualizing results
Communicate results clearly with colleagues and publicly (writing up findings, preparing visualizations and documentation)

Requirements

Strong track record of scientific research (in any field); some prior work on interpretability is expected
Familiarity with Python is required
Experience designing and running experiments and analyzing results (toy-scale and large-scale)
Comfortable with messy, exploratory experimental science and collaborative team research
Ability to write code, build experiment infrastructure, and perform data analysis and visualization
Ability to communicate research results clearly in writing and presentations; publications or public research outputs are requested
Education: at least a Bachelor's degree in a related field or equivalent experience

Role location & policy

Role is based in the San Francisco office (San Francisco, CA). Anthropic is open to considering exceptional candidates for remote work on a case-by-case basis.
Currently, staff are expected to be in one of Anthropic's offices at least ~25% of the time (location-based hybrid policy)
Visa sponsorship: Anthropic does sponsor visas in many cases and retains an immigration lawyer to assist when an offer is made

Compensation

Expected base annual salary range: $315,000 - $560,000 USD (total compensation may include equity, benefits, and incentive compensation)

Nice-to-have / team fit

Interest in mechanistic interpretability, circuits, and transformer analysis
Enjoys collaborative, team-focused science and frequent research discussions
Willingness to write up and share findings publicly, including null results

How to apply / logistics

Applicants are asked to provide publications/research outputs (e.g., Google Scholar, Semantic Scholar). If you do not have publications, consider applying to a Research Engineer role instead.
Anthropic encourages applications from candidates who may not meet every listed qualification and values diverse perspectives.