Staff Infrastructure Engineer, Discovery Team

USD 340,000-425,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Kubernetes @ 4 Spark @ 4 GCP @ 4 Distributed Systems @ 7 AWS @ 4 Communication @ 7 Performance Optimization @ 7 PyTorch @ 4 GPU @ 4

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial. The Discovery Team is focused on building an "AI scientist" — systems capable of solving long-horizon reasoning challenges and enabling scientific workflows. This role works across the whole model stack and focuses on improving models' abilities to use computers as part of those workflows.

About the role

As a Staff Infrastructure Engineer on the Discovery Team you will work end-to-end to identify and address infrastructure blockers on the path to scientific AGI. Strong candidates will have experience with performance optimization, distributed systems, VM/sandboxing/container deployment, and large-scale data pipelines. Familiarity with language model training, evaluation, and inference is highly encouraged.

Responsibilities

  • Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments.
  • Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities.
  • Develop robust and reliable evaluation frameworks for measuring progress toward scientific AGI.
  • Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows.
  • Collaborate to translate experimental requirements into production-ready infrastructure.
  • Develop large-scale data pipelines to handle advanced language model training requirements.
  • Optimize large-scale training and inference pipelines for stable and efficient reinforcement learning.

Requirements

  • 6+ years of highly relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems.
  • Deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads.
  • Experience with containerization technologies (Docker, Kubernetes) and orchestration at scale.
  • Proven track record of building large-scale data pipelines and distributed storage systems.
  • Experience diagnosing and resolving complex infrastructure challenges in production environments.
  • Ability to work across the full ML stack from data pipelines to performance optimization and to collaborate with researchers to scale experimental ideas.
  • Education: at least a Bachelor's degree in a related field or equivalent experience.

Strong candidates may also have

  • Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX).
  • Background building infrastructure for AI research labs or large-scale ML organizations.
  • Knowledge of GPU/TPU architectures and language model inference optimization.
  • Experience with cloud platforms (AWS, GCP) at enterprise scale.
  • Familiarity with VM and container orchestration, workflow orchestration tools, and experiment management systems.
  • History working with large-scale reinforcement learning and comfort with large-scale data pipelines (Beam, Spark, Dask, …).

Logistics & Other details

  • Location: San Francisco, CA. Location-based hybrid policy: staff are expected to be in an office at least ~25% of the time (some roles may require more).
  • Visa sponsorship: Anthropic does sponsor visas and retains an immigration lawyer; sponsorship success varies by role and candidate.
  • The company encourages applicants from diverse and underrepresented backgrounds and welcomes candidates who may not meet every qualification.

Benefits

  • Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space in San Francisco.

How we're different

Anthropic treats high-impact AI research as large-scale empirical science, values collaboration and communication, and focuses on a few large-scale research efforts with strong emphasis on steerable, trustworthy AI.