Engineering Manager, AI Reliability Engineering

at Anthropic

📍 Dublin, Ireland

EUR 295,000-355,000 per year

MIDDLE

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Distributed Systems @ 6 Hiring @ 3 Leadership @ 6 Communication @ 6 Networking @ 3 LLM @ 3 GPU @ 3

Details

Anthropic is hiring an experienced engineering leader to manage the Reliability Engineering team focused on defining and achieving reliability metrics for Anthropic's internal and external products and services. This role leads Software and Systems Engineers working on large language model serving and training systems, and advances reliability practices using modern AI capabilities.

Responsibilities

Lead and grow a team of reliability engineers responsible for LLM serving and training systems
Develop and drive adoption of service level objectives (SLOs) that balance availability/latency with development velocity
Design and implement comprehensive monitoring and observability systems for availability, latency, and other critical metrics
Architect and oversee high‑availability model serving infrastructure that supports millions of external customers and high‑traffic internal workloads
Lead strategy for automated failover and recovery across multiple regions and cloud providers
Establish and manage incident response processes for critical AI services; drive rapid recovery and systematic post‑incident improvements
Direct cost optimization initiatives for large‑scale AI infrastructure with focus on accelerator (GPU/TPU/Trainium) utilization and efficiency
Partner with cross‑functional teams (ML engineers, infrastructure teams, product) to align reliability efforts with company objectives
Build and sustain an engineering culture focused on reliability, operational excellence, and innovation

Requirements

Experience managing and scaling reliability or infrastructure engineering teams
Deep technical knowledge of distributed systems, observability, and monitoring at scale
Understanding of operating AI infrastructure and AI‑specific performance indicators
Track record of implementing SLO/SLA frameworks and driving organizational adoption
Experience bridging technical discussions between ML engineers and infrastructure teams
Strong leadership, communication, and influence skills across organizations
Demonstrated hiring and talent development capabilities

Strong candidates may also have:

Managed teams operating large‑scale model training or serving infrastructure (>1000 GPUs)
Hands‑on experience with ML hardware accelerators (GPUs, TPUs, Trainium)
Understanding of ML‑specific networking optimizations and their operational implications
Led major reliability transformations or infrastructure migrations
Built reliability engineering practices from the ground up or contributed to open‑source ML/infrastructure tooling

Logistics

Education requirement: Bachelor's degree in a related field or equivalent experience
Location & office policy: Dublin, Ireland — location‑based hybrid policy (staff expected to be in offices at least ~25% of the time)
Visa sponsorship: Anthropic does sponsor visas and retains immigration legal support when feasible
Guidance: Applicants are encouraged to apply even if they do not meet every qualification

Compensation & Benefits

Annual Salary: €295,000 - €355,000 (EUR)
Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial. The company emphasizes collaborative, high‑impact research and values strong communication across teams.