Senior Research Scientist, Reward Models

USD 340,000-425,000 per year
SENIOR
✅ Hybrid
✅ Visa Sponsorship

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Python @ 4 Machine Learning @ 4 Communication @ 7 LLM @ 4

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. This role focuses on pushing the frontier of reward modeling for large language models. As a Senior Research Scientist on the Reward Models team you will lead research to specify and learn human preferences at scale, develop novel architectures and training methodologies for RLHF, research new approaches to LLM-based evaluation and grading (including rubric-based methods), and investigate techniques to detect and mitigate reward hacking. You will collaborate across teams to translate research into production improvements and mentor other researchers.

Responsibilities

  • Lead research on novel reward model architectures and training approaches for RLHF
  • Develop and evaluate LLM-based grading and evaluation methods, including rubric-driven approaches that improve consistency and interpretability
  • Research techniques to detect, characterize, and mitigate reward hacking and specification gaming
  • Design experiments to understand reward model generalization, robustness, and failure modes
  • Collaborate with the Finetuning team to translate research insights into improvements for production training pipelines
  • Contribute to research publications, blog posts, and internal documentation
  • Mentor other researchers and help build institutional knowledge around reward modeling

Requirements

  • Track record of research contributions in reward modeling, RLHF, or closely related areas of machine learning
  • Experience training and evaluating reward models for large language models
  • Comfortable designing and running large-scale experiments with significant computational resources
  • Ability to work effectively across research and engineering, iterating quickly while maintaining scientific rigor
  • Strong communication skills and collaborative research experience
  • Education: at least a Bachelor's degree in a related field or equivalent experience
  • Note: interviews for this role are conducted in Python

Strong (desirable) qualifications

  • Published research on reward modeling, preference learning, or RLHF
  • Experience with LLM-as-judge approaches, including calibration and reliability challenges
  • Experience with reward hacking, specification gaming, or related robustness problems
  • Experience with constitutional AI, debate, or other scalable oversight approaches
  • Experience contributing to production ML systems at scale
  • Familiarity with interpretability techniques as applied to reward model behavior

Compensation & Benefits

  • Annual base salary range: $340,000 - $425,000 USD
  • Total compensation for full-time employees includes equity, benefits, and may include incentive compensation
  • Additional benefits mentioned: optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space

Logistics

  • Location-based hybrid policy: staff are expected to be in one of Anthropic’s offices at least ~25% of the time (role is Remote-Friendly but requires some office presence)
  • Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist when they make an offer

About the team and work

  • Work closely with Finetuning, Alignment Science, and the broader research organization
  • Access to frontier models and significant computational resources to run large-scale experiments
  • Emphasis on research that both advances scientific understanding and leads to practical production improvements