Staff / Senior Software Engineer, AI Reliability

at Anthropic

📍 New York City, United States
📍 San Francisco, United States
📍 Seattle, United States

USD 325,000-485,000 per year

MIDDLE SENIOR

✅ Hybrid

✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning LLM GPU

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Distributed Systems @ 6 Hiring @ 3 AWS @ 3 Communication @ 6 Networking @ 3 SRE @ 3 API @ 3 Observability @ 3 AI @ 3 InfiniBand @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The AI Reliability Engineering (AIRE) team partners with teams across Anthropic to improve reliability across the most critical serving paths — from the SDK through the network, API layers, serving infrastructure, and accelerators. The team focuses on building robust, resilient systems for Claude, working across teams during incidents and on cross-cutting projects to improve system reliability.

Responsibilities

Develop appropriate Service Level Objectives (SLOs) for large language model serving systems, balancing availability and latency with development velocity
Design and implement monitoring and observability systems across the token path
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements
Support the reliability of safeguard model serving, critical for site reliability and Anthropic's safety commitments

Requirements

Strong distributed systems, infrastructure, or reliability background — looking for reliability-minded software engineers and SREs
Comfortable jumping into unfamiliar systems during incidents and driving resolution
Ability to think holistically about how systems compose and where seams are
Strong cross-team collaboration and communication skills; able to build lasting relationships across teams
Ownership mindset for outcomes, including systems you do not directly own
Bachelor’s degree in a related field or equivalent experience (minimum requirement)

Preferred / Nice to Have

Experience as an SRE, Production Engineer, or similar reliability-focused role on large-scale systems
Experience operating large-scale model serving or training infrastructure (>1000 GPUs)
Experience with ML hardware accelerators (GPUs, TPUs, AWS Trainium)
Understanding of ML-specific networking optimizations such as RDMA and InfiniBand
Expertise in AI-specific observability tools and frameworks
Experience with chaos engineering and systematic resilience testing
Contributions to open-source infrastructure or ML tooling

Compensation

Annual Salary: $325,000 - $485,000 USD

Logistics

Locations: San Francisco, CA; New York City, NY; Seattle, WA
Location-based hybrid policy: staff are expected to be in one of Anthropic's offices at least 25% of the time
Visa sponsorship: Anthropic does sponsor visas and retains an immigration lawyer; availability may vary by role and candidate
Education requirement: at least a Bachelor’s degree in a related field or equivalent experience

Benefits

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Office space for collaboration

How we work / Culture

Collaborative teams working on a few large-scale research efforts
Emphasis on communication and cross-disciplinary collaboration
Encouragement for applicants from diverse backgrounds; Anthropic values inclusive hiring practices