Research Engineer, Machine Learning Infrastructure (Pre-training)
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 3 Algorithms @ 3 Distributed Systems @ 6 MLOps @ 3 Communication @ 6 Performance Optimization @ 6 Debugging @ 3 Experimentation @ 3 PyTorch @ 3 Cloud Computing @ 6Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial for users and society. The team consists of researchers, engineers, policy experts, and business leaders focused on building beneficial AI systems. Anthropic is at the forefront of AI research, working to ensure that transformative AI systems are aligned with human interests.
Responsibilities
- Design and implement high-performance ML training infrastructure for large language model research
- Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
- Create robust automated evaluation and benchmarking systems for model performance
- Implement comprehensive monitoring and debugging tools for ML workflows
- Design and optimize data loading pipelines to maximize training throughput
- Build MLOps tooling to support reproducible research and experimentation
- Collaborate with research teams to prototype and scale novel training architectures
- Develop infrastructure for efficient hyperparameter sweeps and architecture search
Requirements
- Strong software engineering skills with experience in building distributed systems
- Expertise in Python and experience with distributed computing frameworks
- Deep understanding of cloud computing platforms and distributed systems architecture
- Experience with high-throughput, fault-tolerant system design
- Strong background in performance optimization and system scaling
- Excellent problem-solving skills and attention to detail
- Strong communication skills and ability to work collaboratively
Preferred Qualifications
- Advanced degree (MS or PhD) in Computer Science or related field
- Experience with language model training infrastructure
- Strong background in distributed systems and parallel computing
- Expertise in tokenization algorithms and techniques
- Experience building high-throughput, fault-tolerant systems
- Deep knowledge of monitoring and observability practices
- Experience with infrastructure-as-code and configuration management
- Background in MLOps or ML infrastructure
Benefits and Work Environment
Anthropic is a public benefit corporation headquartered in San Francisco. They offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a collaborative office space. The team values impact, collaboration, and communication, focusing on large-scale, high-impact AI research projects. They currently expect staff to be in the office at least 25% of the time with a location-based hybrid policy, and visa sponsorship is available.
Salary
The annual salary range for this position is $315,000 - $340,000 USD.