Machine Learning Engineer, Safeguards Research

at Anthropic

📍 San Francisco, United States

USD 315,000-340,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 6 Machine Learning @ 3 TensorFlow @ 5 PyTorch @ 5

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial for users and society. The Safeguards Research Team focuses on critical safety research and engineering to ensure AI systems can be deployed safely, enabling responsible scaling of state-of-the-art models.

Responsibilities

Design and implement ML pipelines for training and evaluating safety classifiers and detection models
Develop systems to fine-tune language models for specific safety evaluation tasks
Build infrastructure for hyperparameter optimization and model selection for safety experiments
Create efficient data processing pipelines to manage large-scale model outputs and training datasets
Develop tooling to automate generation, analysis, and classification of jailbreak attempts
Build evaluation frameworks to systematically test model behaviors across safety dimensions
Create flexible interfaces for researchers to experiment with varying model architectures and training configurations

Requirements

Hands-on experience training and fine-tuning basic ML models
Understanding of fundamental ML concepts such as overfitting and regularization
Practical experience improving and evaluating ML models
Proficiency with ML frameworks like PyTorch, TensorFlow, or JAX, including custom training loops
Strong software engineering skills, especially in Python
Experience building scalable data pipelines and ML infrastructure
Familiarity with prompting and working with large language models
Preference for simple, reliable ML engineering solutions
Comfortable working in a fast-paced, collaborative research environment
Commitment to understanding and addressing AI impacts

Strong candidates may also have:

Implemented custom loss functions and evaluation metrics
Experience with experiment and evaluation tracking tools
Built integrated training, evaluation, and deployment pipelines
Contributed to open-source machine learning or AI safety tools

Benefits

Competitive compensation including annual salary range $315,000 - $340,000 USD
Hybrid work policy with expected office presence at least 25% of the time
Visa sponsorship support with immigration lawyer assistance
Generous vacation and parental leave
Flexible working hours
Collaborative and mission-driven team environment with frequent research discussions

Education

At least a Bachelor's degree in a related field or equivalent experience

Location

San Francisco, CA

About Anthropic

Anthropic is a public benefit corporation focused on building trustworthy AI systems, fostering diversity and inclusion, and placing high impact and ethical considerations at the core of AI research.