Staff+ Software Engineer, Inference Runtime

USD 405,000-485,000 per year
SENIOR
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Kubernetes @ 3 Python @ 4 CI/CD @ 4 Distributed Systems @ 4 AWS @ 4 Communication @ 7 Prioritization @ 4 Rust @ 4 Debugging @ 4 CUDA @ 4 GPU @ 4 AI @ 4 Profiling @ 4

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Inference organization serves Claude to millions of users and enterprise customers with the speed, reliability, and efficiency that frontier AI demands. This role is a technical lead for Inference Runtime: the team that owns the shared, accelerator-agnostic core of Anthropic's inference serving stack.

About the role

This is a senior individual-contributor role with broad technical ownership. You will set technical direction for the runtime's architecture, its release and validation systems, and the workflows engineers use to develop on top of it. You will partner across Inference and with central Infrastructure to make decisions about boundaries, prioritization, and tradeoffs across heterogeneous accelerator platforms (GPU, TPU, Trainium). You will work hands-on in a performance-sensitive Rust and Python codebase and mentor engineers on the team.

Responsibilities

  • Set technical direction for the team, owning the architecture and roadmap for the shared runtime of the inference serving stack
  • Own and evolve the accelerator-agnostic runtime: interfaces, internal boundaries, and build structure, including hands-on work in a performance-sensitive Rust and Python codebase
  • Ensure platform expansion costs remain low by isolating specialization and keeping edge cases integrated with the core
  • Drive efficient accelerator usage: utilization, scheduling, and memory management across GPU, TPU, and Trainium
  • Build the runtime's validation surface around partitioned builds, change-scoped testing, and canary/shadow/rollback mechanisms
  • Act as a technical counterpart to central Infrastructure on compilers, build systems, and toolchains; decide when to build vs. adopt
  • Mentor engineers through design and code review and direct collaboration

Minimum qualifications

  • Deep background in systems engineering or ML infrastructure, able to perform performance profiling, latency and throughput optimization, and systems debugging at scale
  • Real depth in at least one accelerator ecosystem (CUDA/GPU, TPU, or Trainium/AWS Neuron) and desire to keep the runtime agnostic across them
  • Significant software engineering experience with high-performance, large-scale distributed systems serving millions of users
  • Experience defining and using engineering metrics (SLOs) to drive measurable improvements in escape rates, release times, latency, or throughput
  • Experience driving technical alignment across organizational boundaries and strong written and verbal communication skills

Preferred qualifications

  • 8+ years of software engineering experience, with significant time as a technical lead or anchor on a platform, inference runtime, or ML infrastructure team
  • Experience with ML compiler toolchains (XLA, Triton, NeuronX) or accelerator driver/firmware management at scale
  • Background operating production as a validation surface at scale: shadow traffic, canary populations, automated baseline comparison, fast rollback
  • Experience with deterministic or simulation-based testing for hardware-dependent systems
  • Experience with CI/CD systems at scale for accelerator workloads
  • Familiarity with Kubernetes-based development and job scheduling environments

Compensation

Annual Salary: $405,000 - $485,000 USD

Logistics

  • Locations: Remote-friendly; San Francisco, CA; Seattle, WA; New York City, NY
  • Minimum education: Bachelor's degree or equivalent experience
  • Location-based hybrid policy: staff expected to be in one of the offices at least ~25% of the time (hybrid)
  • Visa sponsorship: Anthropic states they do sponsor visas and will make reasonable efforts to assist with immigration if an offer is made

Technologies and topics mentioned

Rust, Python, GPUs, CUDA, TPU, Trainium, AWS Neuron, ML compiler toolchains (XLA, Triton, NeuronX), performance profiling, latency and throughput optimization, distributed systems, SLOs/metrics, CI/CD, Kubernetes, build systems, canary/shadow/rollback, validation/testing strategies