Used Tools & Technologies
Machine Learning LLM GPURequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Distributed Systems @ 7
Communication @ 4
Networking @ 4
SRE @ 7
API @ 4
Observability @ 4
AI @ 4
InfiniBand @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The AI Reliability Engineering (AIRE) team partners with teams across Anthropic to improve reliability across critical serving paths — from the SDK through network, API layers, serving infrastructure, and accelerators. AIRE works alongside partner teams to make systems that deliver Claude more robust and resilient, both during incidents and through proactive projects.
Responsibilities
- Develop appropriate Service Level Objectives (SLOs) for large language model serving systems, balancing availability and latency with development velocity.
- Design and implement monitoring and observability systems across the token path.
- Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
- Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
- Support the reliability of safeguard model serving — critical for both site reliability and Anthropic's safety commitments.
Requirements / Qualifications
- Strong background in distributed systems, infrastructure, or reliability engineering; experience as a reliability-minded software engineer or SRE.
- Comfortable troubleshooting unfamiliar systems during incidents and driving resolution without deep prior expertise in every subsystem.
- Holistic systems thinking about how components compose and where seams exist.
- Ability to build lasting cross-team relationships and collaborate broadly across the company.
- Ownership mindset and care about user-facing outcomes, including for systems you don't directly own.
- Excellent communication and collaboration skills.
- Education: at least a Bachelor's degree in a related field or equivalent experience.
Strongly Preferred / Nice-to-Have
- Experience as an SRE, Production Engineer, or similar reliability-focused role on large-scale systems.
- Experience operating large-scale model serving or training infrastructure (e.g., >1000 GPUs).
- Experience with ML hardware accelerators (GPUs, TPUs, Trainium).
- Understanding of ML-specific networking optimizations such as RDMA and InfiniBand.
- Expertise in AI-specific observability tools and frameworks.
- Experience with chaos engineering and systematic resilience testing.
- Contributions to open-source infrastructure or ML tooling.
Logistics
- Location: Dublin, Ireland.
- Location-based hybrid policy: the company expects staff to be in one of their offices at least 25% of the time.
- Visa sponsorship: Anthropic states that they do sponsor visas and will make reasonable efforts to obtain a visa for candidates they make an offer to; they retain an immigration lawyer to assist.
- Education requirement: at least a Bachelor's degree in a related field or equivalent experience.
Compensation & Benefits
- Annual salary range: €235,000 - €295,000 EUR.
- Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.
How We're Different
Anthropic emphasizes large-scale, collaborative AI research focused on steerable, trustworthy AI and values communication and cross-team collaboration. They encourage applications from candidates who may not meet every qualification and prioritize diverse perspectives in building AI systems with social and ethical considerations.