Staff Software Engineer, AI Reliability Engineering

📍 Dublin, Ireland
EUR 235,000-355,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 4 Hiring @ 4 Communication @ 4 Networking @ 4 GPU @ 4

Details

Anthropic is hiring experienced reliability engineers (software and systems engineers) to define and achieve reliability metrics for internal and external AI products and services. The team focuses on improving reliability for model serving and training systems, using modern AI capabilities to reengineer operations and ensure safe, reliable AI at scale.

Responsibilities

  • Develop appropriate Service Level Objectives (SLOs) for large language model serving and training systems, balancing availability/latency with development velocity.
  • Design and implement monitoring systems including availability, latency, and other salient metrics.
  • Assist in the design and implementation of high-availability language model serving infrastructure capable of handling millions of external customers and high-traffic internal workloads.
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident.
  • Build and maintain cost optimization systems for large-scale AI infrastructure, with a focus on accelerator (GPU/TPU/Trainium) utilization and efficiency.

Requirements

  • Extensive experience with distributed systems observability and monitoring at scale.
  • Understanding of the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines.
  • Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
  • Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence).
  • Experience with chaos engineering and systematic resilience testing.
  • Ability to bridge the gap between ML engineers and infrastructure teams.
  • Excellent communication skills.
  • Bachelor’s degree in a related field or equivalent experience (required).

Strong candidates may also

  • Experience operating large-scale model training or serving infrastructure (>1000 GPUs).
  • Experience with ML hardware accelerators (GPUs, TPUs, Trainium).
  • Understanding of ML-specific networking optimizations such as RDMA and InfiniBand.
  • Expertise in AI-specific observability tools and frameworks.
  • Knowledge of ML model deployment strategies and their reliability implications.
  • Contributions to open-source infrastructure or ML tooling.

Logistics

  • Education: At least a Bachelor's degree in a related field or equivalent experience.
  • Location-based hybrid policy: staff are expected to be in one of Anthropic offices at least 25% of the time; some roles may require more office time.
  • Visa sponsorship: Anthropic does sponsor visas and will make reasonable efforts to assist with immigration when they make an offer.
  • Deadline to apply: None (applications reviewed on a rolling basis).

How we're different

Anthropic focuses on large-scale, high-impact AI research and values collaboration, communication, and research-driven empirical work. The company emphasizes building steerable, trustworthy AI and hosts frequent research discussions.

Benefits

Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office spaces for collaboration. They also provide guidance on candidate AI usage in the application process.