Research Scientist/Engineer, Honesty

at Anthropic

📍 New York City, United States
📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 6 Machine Learning @ 3 Data Science @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. They aim to make AI safe and beneficial for users and society. The team includes researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

Responsibilities

Design and implement novel data curation pipelines to identify, verify, and filter training data for accuracy relative to the model’s knowledge
Develop specialized classifiers to detect potential hallucinations or miscalibrated claims made by the model
Create and maintain comprehensive honesty benchmarks and evaluation frameworks
Implement techniques to ground model outputs in verified information, including search and retrieval-augmented generation (RAG) systems
Design and deploy human feedback collection systems for identifying and correcting miscalibrated responses
Develop prompting pipelines to generate data that improves model accuracy and honesty
Develop and test novel reinforcement learning environments rewarding truthful outputs and penalizing fabricated claims
Create tools to help human evaluators efficiently assess model outputs for accuracy

Requirements

MS or PhD in Computer Science, Machine Learning, or related field
Strong programming skills in Python
Industry experience with language model fine-tuning and classifier training
Proficiency in experimental design and statistical analysis for measuring calibration and accuracy improvements
Strong interest in AI safety, accuracy, and honesty of AI systems
Experience in data science or dataset creation and curation for fine-tuning large language models
Understanding of uncertainty, calibration, and truthfulness metrics in model outputs

Strong candidates may also have

Published work on hallucination prevention, factual grounding, or knowledge integration in language models
Experience with fact-grounding techniques
Background in confidence estimation or calibration methods for ML models
Experience creating and maintaining factual knowledge bases
Familiarity with reinforcement learning from human feedback (RLHF) applied to truthfulness
Experience with crowdsourcing platforms and human feedback collection
Experience developing model accuracy or hallucination evaluation methods

Anthropic is committed to building advanced AI systems that behave reliably and ethically while aligned with human values.