Vacancy is archived. Applications are no longer accepted.
Research Engineer, Machine Learning Infrastructure (Pre-Training)
at Anthropic
š San Francisco, United States
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 3 Algorithms @ 3 Distributed Systems @ 6 MLOps @ 3 Communication @ 6 Performance Optimization @ 6 Debugging @ 3 Experimentation @ 3 PyTorch @ 3 Cloud Computing @ 6Details
Anthropic is building reliable, interpretable, and steerable AI systems and is seeking a Research Engineer to join the Pretraining team working on next-generation large language model training infrastructure. This role sits at the intersection of research and engineering, contributing to safe, scalable, and reproducible model training.
Responsibilities
- Design and implement high-performance ML training infrastructure for large language model research
- Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
- Create robust automated evaluation and benchmarking systems for model performance
- Implement comprehensive monitoring and debugging tools for ML workflows
- Design and optimize data loading pipelines to maximize training throughput
- Build MLOps tooling to support reproducible research and experimentation
- Collaborate with research teams to prototype and scale novel training architectures
- Develop infrastructure for efficient hyperparameter sweeps and architecture search
Requirements
- Strong software engineering skills with experience building distributed systems
- Expertise in Python and experience with distributed computing frameworks
- Deep understanding of cloud computing platforms and distributed systems architecture
- Experience with high-throughput, fault-tolerant system design
- Strong background in performance optimization and system scaling
- Excellent problem-solving skills and attention to detail
- Strong communication skills and ability to work collaboratively
- Education: at least a Bachelor's degree in a related field or equivalent experience
Preferred / Strong Candidates May Have
- Advanced degree (MS or PhD) in Computer Science or related field
- Experience with language model training infrastructure
- Strong background in distributed systems and parallel computing
- Expertise in tokenization algorithms and techniques
- Experience building high-throughput, fault-tolerant systems
- Deep knowledge of monitoring and observability practices
- Experience with infrastructure-as-code and configuration management
- Background in MLOps or ML infrastructure
Logistics & Other Details
- Location: San Francisco, CA (Anthropic headquarters)
- Office policy: Location-based hybrid ā staff expected to be in an office at least ~25% of the time (roles may require more office time)
- Visa sponsorship: Anthropic indicates they do sponsor visas and will make reasonable efforts if an offer is made
- Encouragement to apply even if you do not meet every qualification
Compensation
- Annual Salary: $315,000 - $340,000 USD
Benefits
- Competitive compensation and benefits
- Optional equity donation matching
- Generous vacation and parental leave
- Flexible working hours and a collaborative office environment
- Guidance on candidate AI usage during the application process