Research Scientist/Engineer, Honesty

at Anthropic

📍 New York City, United States
📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 6 Machine Learning @ 3 Data Science @ 3 Communication @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Finetuning Alignment team is building techniques to minimize hallucinations and improve truthfulness in language models. This role focuses on creating robust systems that produce accurate outputs, reflect calibrated confidence, and avoid deceptive or misleading behavior across diverse domains.

Responsibilities

Design and implement data curation pipelines to identify, verify, and filter training data for accuracy relative to the model’s knowledge
Develop specialized classifiers to detect potential hallucinations or miscalibrated claims made by the model
Create and maintain comprehensive honesty benchmarks and evaluation frameworks
Implement grounding techniques for model outputs, including search and retrieval-augmented generation (RAG) systems
Design and deploy human feedback collection systems specifically for identifying and correcting miscalibrated responses
Design and implement prompting pipelines to generate data that improves model accuracy and honesty
Develop and test novel reinforcement learning environments that reward truthful outputs and penalize fabricated claims
Create tools to help human evaluators efficiently assess model outputs for accuracy

Requirements

MS or PhD in Computer Science, Machine Learning, or a related field (Bachelor required as minimum; equivalent experience acceptable)
Strong programming skills in Python
Industry experience with language model finetuning and classifier training
Proficiency in experimental design and statistical analysis to measure improvements in calibration and accuracy
Experience in data science or creation/curation of datasets for finetuning large language models
Understanding of metrics for uncertainty, calibration, and truthfulness in model outputs
Commitment to AI safety and to improving accuracy and honesty of AI systems

Strong candidates may also have

Published work on hallucination prevention, factual grounding, or knowledge integration in language models
Experience with fact-grounding techniques and building factual knowledge bases
Background in developing confidence estimation or calibration methods for ML models
Familiarity with RLHF specifically applied to improving model truthfulness
Experience working with crowd-sourcing platforms and human feedback collection systems
Experience developing evaluations of model accuracy or hallucinations

Logistics

Locations: New York City, NY; San Francisco, CA (team preference for New York)
Location-based hybrid policy: staff expected to be in an office at least 25% of the time
Education: at least a Bachelor's degree in a related field or equivalent experience (MS/PhD preferred)
Visa sponsorship: Anthropic will make reasonable efforts to sponsor visas for candidates where possible

Benefits

Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and offices in which to collaborate

How we work

Anthropic pursues large-scale empirical research with strong collaboration and frequent research discussions. The team values communication and impact-focused research to advance steerable, trustworthy AI. Join to help ensure advanced AI systems behave reliably, ethically, and aligned with human values.