Used Tools & Technologies
Machine Learning LLM GPURequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Distributed Systems @ 6
Hiring @ 3
AWS @ 3
Communication @ 6
Networking @ 3
SRE @ 3
API @ 3
Observability @ 3
AI @ 3
InfiniBand @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The AI Reliability Engineering (AIRE) team partners with teams across Anthropic to improve reliability across the most critical serving paths — from the SDK through the network, API layers, serving infrastructure, and accelerators. The team focuses on building robust, resilient systems for Claude, working across teams during incidents and on cross-cutting projects to improve system reliability.
Responsibilities
- Develop appropriate Service Level Objectives (SLOs) for large language model serving systems, balancing availability and latency with development velocity
- Design and implement monitoring and observability systems across the token path
- Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
- Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements
- Support the reliability of safeguard model serving, critical for site reliability and Anthropic's safety commitments
Requirements
- Strong distributed systems, infrastructure, or reliability background — looking for reliability-minded software engineers and SREs
- Comfortable jumping into unfamiliar systems during incidents and driving resolution
- Ability to think holistically about how systems compose and where seams are
- Strong cross-team collaboration and communication skills; able to build lasting relationships across teams
- Ownership mindset for outcomes, including systems you do not directly own
- Bachelor’s degree in a related field or equivalent experience (minimum requirement)
Preferred / Nice to Have
- Experience as an SRE, Production Engineer, or similar reliability-focused role on large-scale systems
- Experience operating large-scale model serving or training infrastructure (>1000 GPUs)
- Experience with ML hardware accelerators (GPUs, TPUs, AWS Trainium)
- Understanding of ML-specific networking optimizations such as RDMA and InfiniBand
- Expertise in AI-specific observability tools and frameworks
- Experience with chaos engineering and systematic resilience testing
- Contributions to open-source infrastructure or ML tooling
Compensation
- Annual Salary: $325,000 - $485,000 USD
Logistics
- Locations: San Francisco, CA; New York City, NY; Seattle, WA
- Location-based hybrid policy: staff are expected to be in one of Anthropic's offices at least 25% of the time
- Visa sponsorship: Anthropic does sponsor visas and retains an immigration lawyer; availability may vary by role and candidate
- Education requirement: at least a Bachelor’s degree in a related field or equivalent experience
Benefits
- Competitive compensation and benefits
- Optional equity donation matching
- Generous vacation and parental leave
- Flexible working hours
- Office space for collaboration
How we work / Culture
- Collaborative teams working on a few large-scale research efforts
- Emphasis on communication and cross-disciplinary collaboration
- Encouragement for applicants from diverse backgrounds; Anthropic values inclusive hiring practices