Staff Software Engineer, AI Reliability Engineering

GBP 325,000-390,000 per year
SENIOR
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning LLM GPU

Required Skills & Competences

Distributed Systems @ 7 Communication @ 7 Networking @ 4 SRE @ 4 API @ 4 Observability @ 4 AI @ 4 InfiniBand @ 4

Details

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The team is a quickly growing group of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across the token path: SDKs, network, API layers, serving infrastructure, and accelerators. The team focuses on improving robustness and resilience for Claude’s serving systems, working across teams to design, implement, and operate high-availability model serving infrastructure and observability.

Responsibilities

  • Develop appropriate Service Level Objectives (SLOs) for large language model serving systems, balancing availability and latency with development velocity.
  • Design and implement monitoring and observability systems across the token path.
  • Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
  • Support the reliability of safeguard model serving, critical for both site reliability and Anthropic's safety commitments.

Requirements

  • Strong distributed systems, infrastructure, or reliability background (reliability-minded software engineers and SREs are sought).
  • Comfortable jumping into unfamiliar systems during incidents and driving resolution.
  • Ability to think holistically about system composition and seams between components.
  • Strong communication and collaboration skills; ability to build relationships across teams.
  • Ownership mindset for user-facing outcomes, even for systems you don't directly own.
  • Education: at least a Bachelor's degree in a related field or equivalent experience.

Strong candidates may also have:

  • Experience as an SRE, Production Engineer, or similar reliability-focused roles on large-scale systems.
  • Experience operating large-scale model serving or training infrastructure (>1000 GPUs).
  • Experience with ML hardware accelerators (GPUs, TPUs, Trainium).
  • Understanding of ML-specific networking optimizations like RDMA and InfiniBand.
  • Expertise in AI-specific observability tools and frameworks.
  • Experience with chaos engineering and systematic resilience testing.
  • Contributions to open-source infrastructure or ML tooling.

Logistics

  • Location: London, UK. Location-based hybrid policy: staff are expected to be in one of Anthropic's offices at least 25% of the time (some roles may require more time in offices).
  • Visa sponsorship: Anthropic states they do sponsor visas and will make reasonable efforts to obtain a visa for offers, retaining an immigration lawyer to help. Note they cannot guarantee sponsorship for every role/candidate.

Compensation

  • Annual Salary: £325,000 - £390,000 GBP

Benefits

  • Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

How we're different

Anthropic emphasizes large-scale, collaborative AI research focused on steerable, trustworthy systems and values communication and cross-team collaboration. Candidates are encouraged to apply even if they do not meet every listed qualification.