Vacancy is archived. Applications are no longer accepted.

Research Engineer, Machine Learning Infrastructure (Pre-Training)

MIDDLE
āœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Python @ 3 Algorithms @ 3 Distributed Systems @ 6 MLOps @ 3 Communication @ 6 Performance Optimization @ 6 Debugging @ 3 Experimentation @ 3 PyTorch @ 3 Cloud Computing @ 6

Details

Anthropic is building reliable, interpretable, and steerable AI systems and is seeking a Research Engineer to join the Pretraining team working on next-generation large language model training infrastructure. This role sits at the intersection of research and engineering, contributing to safe, scalable, and reproducible model training.

Responsibilities

  • Design and implement high-performance ML training infrastructure for large language model research
  • Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
  • Create robust automated evaluation and benchmarking systems for model performance
  • Implement comprehensive monitoring and debugging tools for ML workflows
  • Design and optimize data loading pipelines to maximize training throughput
  • Build MLOps tooling to support reproducible research and experimentation
  • Collaborate with research teams to prototype and scale novel training architectures
  • Develop infrastructure for efficient hyperparameter sweeps and architecture search

Requirements

  • Strong software engineering skills with experience building distributed systems
  • Expertise in Python and experience with distributed computing frameworks
  • Deep understanding of cloud computing platforms and distributed systems architecture
  • Experience with high-throughput, fault-tolerant system design
  • Strong background in performance optimization and system scaling
  • Excellent problem-solving skills and attention to detail
  • Strong communication skills and ability to work collaboratively
  • Education: at least a Bachelor's degree in a related field or equivalent experience

Preferred / Strong Candidates May Have

  • Advanced degree (MS or PhD) in Computer Science or related field
  • Experience with language model training infrastructure
  • Strong background in distributed systems and parallel computing
  • Expertise in tokenization algorithms and techniques
  • Experience building high-throughput, fault-tolerant systems
  • Deep knowledge of monitoring and observability practices
  • Experience with infrastructure-as-code and configuration management
  • Background in MLOps or ML infrastructure

Logistics & Other Details

  • Location: San Francisco, CA (Anthropic headquarters)
  • Office policy: Location-based hybrid — staff expected to be in an office at least ~25% of the time (roles may require more office time)
  • Visa sponsorship: Anthropic indicates they do sponsor visas and will make reasonable efforts if an offer is made
  • Encouragement to apply even if you do not meet every qualification

Compensation

  • Annual Salary: $315,000 - $340,000 USD

Benefits

  • Competitive compensation and benefits
  • Optional equity donation matching
  • Generous vacation and parental leave
  • Flexible working hours and a collaborative office environment
  • Guidance on candidate AI usage during the application process