Engineering Manager - AI Reliability

USD 405,000-485,000 per year
MIDDLE
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 6 Hiring @ 3 Leadership @ 3 Communication @ 3 Networking @ 3 GPU @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Engineering Manager - AI Reliability will manage a Reliability Engineering team (Software Engineers and Systems Engineers) focused on defining and achieving reliability metrics for Anthropic's critical serving systems. This leader will drive improvements in reliability for large language model serving and pioneer the use of modern AI capabilities to reengineer reliability engineering practices.

Responsibilities

  • Lead and grow a team of reliability engineers responsible for large language model serving.
  • Drive the development and adoption of Service Level Objectives (SLOs) that balance availability/latency with development velocity across the organization.
  • Oversee design and implementation of comprehensive monitoring and observability systems for availability, latency, and other critical metrics.
  • Guide architecture of high-availability language model serving infrastructure capable of supporting millions of external customers and high-traffic internal workloads.
  • Lead strategy for automated failover and recovery systems across multiple regions and cloud providers.
  • Establish and manage incident response processes for critical AI services, ensuring rapid recovery and systematic improvements.
  • Direct cost optimization initiatives for large-scale AI infrastructure, with a focus on accelerator (GPU/TPU/Trainium) utilization and efficiency.
  • Partner with cross-functional teams (ML engineers, infrastructure teams, product) to align reliability efforts with company objectives.
  • Build and maintain an engineering culture focused on reliability, operational excellence, and innovation.

Requirements

  • Experience managing and scaling reliability or infrastructure engineering teams.
  • Deep technical knowledge of distributed systems, observability, and monitoring at scale.
  • Experience operating AI infrastructure and guiding technical decisions specific to ML workloads.
  • Proven experience implementing SLO/SLA frameworks and driving organization-wide adoption.
  • Familiarity with traditional infrastructure metrics and AI-specific performance indicators.
  • Ability to lead technical discussions and translate requirements between ML engineers and infrastructure teams.
  • Excellent leadership, communication, hiring, and talent development skills.
  • Bachelor’s degree in a related field or equivalent experience (required).

Strong candidates may also

  • Have managed teams operating large-scale model training or serving infrastructure (>1000 GPUs).
  • Bring hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium).
  • Understand ML-specific networking optimizations and their operational implications.
  • Have led major reliability transformations or infrastructure migrations.
  • Possess experience building reliability engineering practices from the ground up.
  • Have contributed to or led open-source infrastructure or ML tooling initiatives.
  • Demonstrate thought leadership in the reliability engineering community.

Logistics

  • Location: San Francisco, CA. Location-based hybrid policy: staff expected to be in an office at least 25% of the time (some roles may require more time on-site).
  • Education requirements: at least a Bachelor’s degree in a related field or equivalent experience.
  • Visa sponsorship: Anthropic does sponsor visas where feasible and retains immigration counsel to assist successful candidates.

Compensation

  • Annual salary range: $405,000 - $485,000 USD.

Benefits / Company highlights

  • Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a collaborative office in San Francisco.
  • Anthropic is a public benefit corporation focusing on high-impact AI research and values communication, collaboration, and diverse perspectives.

How we're different

  • Anthropic focuses on large-scale AI research efforts, values impact, and treats AI research as an empirical science. Frequent research discussions and collaboration are core to the culture.