Vacancy is archived. Applications are no longer accepted.

Research Engineer, Machine Learning Infrastructure (Pre-Training)

at Anthropic

📍 San Francisco, United States

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 3 Algorithms @ 3 Distributed Systems @ 6 MLOps @ 3 Communication @ 6 Performance Optimization @ 6 Debugging @ 3 Experimentation @ 3 PyTorch @ 3 Cloud Computing @ 6

Details

Anthropic is building reliable, interpretable, and steerable AI systems and is seeking a Research Engineer to join the Pretraining team working on next-generation large language model training infrastructure. This role sits at the intersection of research and engineering, contributing to safe, scalable, and reproducible model training.

Responsibilities

Design and implement high-performance ML training infrastructure for large language model research
Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
Create robust automated evaluation and benchmarking systems for model performance
Implement comprehensive monitoring and debugging tools for ML workflows
Design and optimize data loading pipelines to maximize training throughput
Build MLOps tooling to support reproducible research and experimentation
Collaborate with research teams to prototype and scale novel training architectures
Develop infrastructure for efficient hyperparameter sweeps and architecture search

Requirements

Strong software engineering skills with experience building distributed systems
Expertise in Python and experience with distributed computing frameworks
Deep understanding of cloud computing platforms and distributed systems architecture
Experience with high-throughput, fault-tolerant system design
Strong background in performance optimization and system scaling
Excellent problem-solving skills and attention to detail
Strong communication skills and ability to work collaboratively
Education: at least a Bachelor's degree in a related field or equivalent experience

Preferred / Strong Candidates May Have

Advanced degree (MS or PhD) in Computer Science or related field
Experience with language model training infrastructure
Strong background in distributed systems and parallel computing
Expertise in tokenization algorithms and techniques
Experience building high-throughput, fault-tolerant systems
Deep knowledge of monitoring and observability practices
Experience with infrastructure-as-code and configuration management
Background in MLOps or ML infrastructure

Logistics & Other Details

Location: San Francisco, CA (Anthropic headquarters)
Office policy: Location-based hybrid — staff expected to be in an office at least ~25% of the time (roles may require more office time)
Visa sponsorship: Anthropic indicates they do sponsor visas and will make reasonable efforts if an offer is made
Encouragement to apply even if you do not meet every qualification

Compensation

Annual Salary: $315,000 - $340,000 USD

Benefits

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours and a collaborative office environment
Guidance on candidate AI usage during the application process