Used Tools & Technologies
Not specified
Required Skills & Competences ?
Algorithms @ 3 Distributed Systems @ 3 Machine Learning @ 3 Debugging @ 3 GPU @ 3Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial for users and society. The Scaling Team builds the infrastructure that powers Anthropic's largest scale pre-training runs, operating at the intersection of research, performance, and distributed systems. This role focuses on identifying systems problems when running ML at scale and developing solutions to optimize throughput and robustness of large distributed ML systems.
Responsibilities
- Develop and optimize the pretraining pipeline and unified infrastructure to maximize efficiency across different computing architectures.
- Identify and solve large-scale systems problems affecting ML workloads and implement systems that improve throughput and robustness.
- Implement low-latency, high-throughput sampling for large language models.
- Implement GPU kernels and adapt models to low-precision inference.
- Design and implement custom load-balancing algorithms to optimize serving efficiency.
- Build quantitative models of system performance and debug kernel-level network latency spikes in containerized environments.
- Design and implement fault-tolerant distributed systems with complex network topologies.
- Collaborate closely with researchers and engineers (pair programming is encouraged) and contribute to both product deployment and long-term research initiatives.
Requirements
- Significant software engineering or machine learning experience, particularly at supercomputing or large-scale systems scale.
- Bachelor's degree in a related field or equivalent experience (required).
- Results-oriented, flexible, and impact-focused approach to work.
- Willingness to take responsibility beyond narrow job boundaries and to pair-program collaboratively.
- Interest in learning more about machine learning research and care about societal impacts of AI.
Strong candidates may also have experience with:
- High-performance, large-scale ML systems and high-performance computing / supercomputing.
- GPU / accelerator programming and implementing GPU kernels.
- ML framework internals and OS internals.
- Language modeling with transformers and low-precision inference techniques.
- Debugging kernel-level network latency in containerized environments and designing fault-tolerant distributed systems.
Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours and a pleasant office space for collaboration.
- Visa sponsorship is available in many cases (Anthropic retains an immigration lawyer and makes reasonable efforts to sponsor when an offer is made).
Logistics
- Location: London, UK. Location-based hybrid policy: staff are expected to be in an office at least 25% of the time (some roles may require more time in office).
- Application deadline: None (applications reviewed on a rolling basis).
- Guidance on candidates' AI usage is provided as part of the application process.