Staff / Senior Software Engineer, AI Reliability

USD 325,000-485,000 per year
MIDDLE SENIOR
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning LLM GPU

Required Skills & Competences

Distributed Systems @ 6 Hiring @ 3 AWS @ 3 Communication @ 6 Networking @ 3 SRE @ 3 API @ 3 Observability @ 3 AI @ 3 InfiniBand @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The AI Reliability Engineering (AIRE) team partners with teams across Anthropic to improve reliability across the most critical serving paths — from the SDK through the network, API layers, serving infrastructure, and accelerators. The team focuses on building robust, resilient systems for Claude, working across teams during incidents and on cross-cutting projects to improve system reliability.

Responsibilities

  • Develop appropriate Service Level Objectives (SLOs) for large language model serving systems, balancing availability and latency with development velocity
  • Design and implement monitoring and observability systems across the token path
  • Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
  • Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements
  • Support the reliability of safeguard model serving, critical for site reliability and Anthropic's safety commitments

Requirements

  • Strong distributed systems, infrastructure, or reliability background — looking for reliability-minded software engineers and SREs
  • Comfortable jumping into unfamiliar systems during incidents and driving resolution
  • Ability to think holistically about how systems compose and where seams are
  • Strong cross-team collaboration and communication skills; able to build lasting relationships across teams
  • Ownership mindset for outcomes, including systems you do not directly own
  • Bachelor’s degree in a related field or equivalent experience (minimum requirement)

Preferred / Nice to Have

  • Experience as an SRE, Production Engineer, or similar reliability-focused role on large-scale systems
  • Experience operating large-scale model serving or training infrastructure (>1000 GPUs)
  • Experience with ML hardware accelerators (GPUs, TPUs, AWS Trainium)
  • Understanding of ML-specific networking optimizations such as RDMA and InfiniBand
  • Expertise in AI-specific observability tools and frameworks
  • Experience with chaos engineering and systematic resilience testing
  • Contributions to open-source infrastructure or ML tooling

Compensation

  • Annual Salary: $325,000 - $485,000 USD

Logistics

  • Locations: San Francisco, CA; New York City, NY; Seattle, WA
  • Location-based hybrid policy: staff are expected to be in one of Anthropic's offices at least 25% of the time
  • Visa sponsorship: Anthropic does sponsor visas and retains an immigration lawyer; availability may vary by role and candidate
  • Education requirement: at least a Bachelor’s degree in a related field or equivalent experience

Benefits

  • Competitive compensation and benefits
  • Optional equity donation matching
  • Generous vacation and parental leave
  • Flexible working hours
  • Office space for collaboration

How we work / Culture

  • Collaborative teams working on a few large-scale research efforts
  • Emphasis on communication and cross-disciplinary collaboration
  • Encouragement for applicants from diverse backgrounds; Anthropic values inclusive hiring practices