Staff Software Engineer, Kubernetes Platform

USD 320,000-405,000 per year
SENIOR
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

IaC Machine Learning GPU

Required Skills & Competences

Consul @ 4 Go @ 6 Kubernetes @ 4 Linux @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 AWS @ 4 Communication @ 7 Networking @ 3 Rust @ 6 API @ 4 AI @ 4 NCCL @ 3 Slurm @ 4

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Kubernetes Platform team owns the Kubernetes control plane and core cluster services that make Anthropic’s fleets of hundreds of thousands of nodes work across multiple cloud providers and datacenters. The team focuses on scheduler extensions, scaling control plane components, and building core services that must remain fast, correct, and highly available as object and node counts grow by orders of magnitude.

Responsibilities

  • Own, operate, and extend the Kubernetes scheduler for accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption.
  • Scale Kubernetes control plane components (apiserver, etcd, controller-manager) to support clusters far beyond typical limits and identify bottlenecks proactively.
  • Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on.
  • Build and maintain custom controllers, operators, and CRDs.
  • Partner with research, training, and inference teams to understand workload shapes and convert requirements into platform capabilities.
  • Collaborate with cloud providers on required features and escalations.
  • Participate in on-call rotations, lead incident response, and design processes (postmortems, runbooks, SLOs) to prevent repeat failures.

Requirements (Minimum qualifications)

  • Significant software engineering experience building and operating production distributed systems.
  • Proficiency in at least one systems-appropriate language (examples provided: Go, Python, Rust, or C++).
  • Deep, hands-on Kubernetes experience (beyond user-level) in areas such as scheduler, controllers, apiserver, or operating large multi-tenant clusters.
  • Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes.
  • Track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on.
  • Strong written and verbal communication; comfortable building consensus with internal stakeholders.

Preferred qualifications

  • Experience with Kubernetes internals or contributions (kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar).
  • Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents).
  • Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments).
  • Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL.
  • Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code.
  • Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF.
  • 8+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects.

Compensation

  • Annual Salary: $320,000 - $405,000 USD

Logistics

  • Minimum education: Bachelor’s degree or equivalent combination of education, training, and/or experience.
  • Location-based hybrid policy: staff are expected to be in one of the offices at least 25% of the time (role may require more time in offices).
  • Visa sponsorship: The company states they do sponsor visas and retain an immigration lawyer to help with the process.