Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 6 Machine Learning @ 3 Data Science @ 3 Communication @ 3Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Finetuning Alignment team is building techniques to minimize hallucinations and improve truthfulness in language models. This role focuses on creating robust systems that produce accurate outputs, reflect calibrated confidence, and avoid deceptive or misleading behavior across diverse domains.
Responsibilities
- Design and implement data curation pipelines to identify, verify, and filter training data for accuracy relative to the model’s knowledge
- Develop specialized classifiers to detect potential hallucinations or miscalibrated claims made by the model
- Create and maintain comprehensive honesty benchmarks and evaluation frameworks
- Implement grounding techniques for model outputs, including search and retrieval-augmented generation (RAG) systems
- Design and deploy human feedback collection systems specifically for identifying and correcting miscalibrated responses
- Design and implement prompting pipelines to generate data that improves model accuracy and honesty
- Develop and test novel reinforcement learning environments that reward truthful outputs and penalize fabricated claims
- Create tools to help human evaluators efficiently assess model outputs for accuracy
Requirements
- MS or PhD in Computer Science, Machine Learning, or a related field (Bachelor required as minimum; equivalent experience acceptable)
- Strong programming skills in Python
- Industry experience with language model finetuning and classifier training
- Proficiency in experimental design and statistical analysis to measure improvements in calibration and accuracy
- Experience in data science or creation/curation of datasets for finetuning large language models
- Understanding of metrics for uncertainty, calibration, and truthfulness in model outputs
- Commitment to AI safety and to improving accuracy and honesty of AI systems
Strong candidates may also have
- Published work on hallucination prevention, factual grounding, or knowledge integration in language models
- Experience with fact-grounding techniques and building factual knowledge bases
- Background in developing confidence estimation or calibration methods for ML models
- Familiarity with RLHF specifically applied to improving model truthfulness
- Experience working with crowd-sourcing platforms and human feedback collection systems
- Experience developing evaluations of model accuracy or hallucinations
Logistics
- Locations: New York City, NY; San Francisco, CA (team preference for New York)
- Location-based hybrid policy: staff expected to be in an office at least 25% of the time
- Education: at least a Bachelor's degree in a related field or equivalent experience (MS/PhD preferred)
- Visa sponsorship: Anthropic will make reasonable efforts to sponsor visas for candidates where possible
Benefits
- Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and offices in which to collaborate
How we work
Anthropic pursues large-scale empirical research with strong collaboration and frequent research discussions. The team values communication and impact-focused research to advance steerable, trustworthy AI. Join to help ensure advanced AI systems behave reliably, ethically, and aligned with human values.