Engineering Manager, AI Reliability Engineering

📍 Dublin, Ireland
EUR 295,000-355,000 per year
MIDDLE
âś… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 6 Hiring @ 3 Leadership @ 6 Communication @ 6 Networking @ 3 LLM @ 3 GPU @ 3

Details

Anthropic is hiring an experienced engineering leader to manage the Reliability Engineering team focused on defining and achieving reliability metrics for Anthropic's internal and external products and services. This role leads Software and Systems Engineers working on large language model serving and training systems, and advances reliability practices using modern AI capabilities.

Responsibilities

  • Lead and grow a team of reliability engineers responsible for LLM serving and training systems
  • Develop and drive adoption of service level objectives (SLOs) that balance availability/latency with development velocity
  • Design and implement comprehensive monitoring and observability systems for availability, latency, and other critical metrics
  • Architect and oversee high‑availability model serving infrastructure that supports millions of external customers and high‑traffic internal workloads
  • Lead strategy for automated failover and recovery across multiple regions and cloud providers
  • Establish and manage incident response processes for critical AI services; drive rapid recovery and systematic post‑incident improvements
  • Direct cost optimization initiatives for large‑scale AI infrastructure with focus on accelerator (GPU/TPU/Trainium) utilization and efficiency
  • Partner with cross‑functional teams (ML engineers, infrastructure teams, product) to align reliability efforts with company objectives
  • Build and sustain an engineering culture focused on reliability, operational excellence, and innovation

Requirements

  • Experience managing and scaling reliability or infrastructure engineering teams
  • Deep technical knowledge of distributed systems, observability, and monitoring at scale
  • Understanding of operating AI infrastructure and AI‑specific performance indicators
  • Track record of implementing SLO/SLA frameworks and driving organizational adoption
  • Experience bridging technical discussions between ML engineers and infrastructure teams
  • Strong leadership, communication, and influence skills across organizations
  • Demonstrated hiring and talent development capabilities

Strong candidates may also have:

  • Managed teams operating large‑scale model training or serving infrastructure (>1000 GPUs)
  • Hands‑on experience with ML hardware accelerators (GPUs, TPUs, Trainium)
  • Understanding of ML‑specific networking optimizations and their operational implications
  • Led major reliability transformations or infrastructure migrations
  • Built reliability engineering practices from the ground up or contributed to open‑source ML/infrastructure tooling

Logistics

  • Education requirement: Bachelor's degree in a related field or equivalent experience
  • Location & office policy: Dublin, Ireland — location‑based hybrid policy (staff expected to be in offices at least ~25% of the time)
  • Visa sponsorship: Anthropic does sponsor visas and retains immigration legal support when feasible
  • Guidance: Applicants are encouraged to apply even if they do not meet every qualification

Compensation & Benefits

  • Annual Salary: €295,000 - €355,000 (EUR)
  • Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial. The company emphasizes collaborative, high‑impact research and values strong communication across teams.