Engineering Manager, Inference Routing and Performance

at Anthropic

📍 New York City, United States
📍 San Francisco, United States

USD 405,000-485,000 per year

MIDDLE

✅ Hybrid

✅ Visa Sponsorship

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Machine Learning @ 3 Hiring @ 3 Communication @ 6 Networking @ 3 Debugging @ 3 API @ 3 Engineering Management @ 5 LLM @ 3 GPU @ 2 AI @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Inference Routing team builds the cluster-level routing and coordination plane for Anthropic's inference fleet — the system between the API surface and the inference engines that makes fleet-wide efficiency decisions in real time. The team focuses on routing decisions that account for caching, accelerator suitability, and in-flight work to maximize throughput and meet latency SLOs.

Representative work

Decide whether a proposed routing algorithm change is worth deploy risk, given modeled throughput gains and blast radius
Sequence competing priorities (e.g., KV-cache offload, new coordination protocol, model launches)
Debug persistent tail-latency regressions from fleet-level metrics down to kernel/network/framework issues
Build quantitative cases to peer teams for cross-team protocol changes
Run post-incident reviews and turn them into lasting process changes
Interview and evaluate candidates with deep systems and scheduler experience

Responsibilities

Drive system-level performance

Own the technical roadmap for cluster-level inference efficiency: routing decisions, cache placement and eviction, cross-replica coordination, and synchronization protocols
Partner with inference engine, kernels, and performance teams to identify fleet-level throughput and latency wins and turn them into measurable shipped improvements
Build and enforce quantitative performance modeling practices: claim wins only when measurable and know expected effects before shipping

Deliver reliably and operate cleanly

Set technical strategy for routing across heterogeneous hardware (GPUs, TPUs, Trainium) and serving surfaces
Run the team's operational backbone: on-call rotation, incident response, postmortems, and deploy safety
Clarify dependencies and commitments between API surface, inference engines, and cloud deployment teams

Build and grow the team

Develop, retain, and hire a strong team that can operate at OS and framework levels when required
Coach engineers through shifting priorities driven by model launches, hardware changes, and scaling demands
Step in to unblock critical deploys or synthesize design debates when necessary

Requirements

5+ years of engineering management experience, ideally with part of that leading critical-path production infrastructure at scale
Deep systems background (examples: load balancing, scheduling, cache-coherent distributed state, high-performance networking) sufficient to make architectural calls and evaluate kernel/framework-level work
Experience shipping performance improvements in large-scale systems with measurable impact
Experience running production infrastructure with operational stakes: on-call, incident response, capacity events, and deploy discipline
Results-oriented with a bias toward impact; able to balance throughput, latency, stability, and feature velocity
Strong cross-team communication and collaboration skills
Curious about machine learning systems; willing to learn transformer inference and its systems implications

Strong candidates may also have

Experience with LLM inference serving: KV caching, continuous batching, request scheduling, prefill/decode disaggregation
Background in cluster schedulers, load balancers, service meshes, or coordination planes at scale
Familiarity with heterogeneous accelerator fleets (GPU/TPU/Trainium) and workload placement trade-offs
Experience with GPU/accelerator programming, ML framework internals, or OS-level performance debugging
Experience leading teams at supercomputing or hyperscaler infrastructure scale or through rapid growth

Compensation

Annual Salary: $405,000 - $485,000 USD

Logistics

Education: At least a Bachelor's degree in a related field or equivalent experience
Location: San Francisco, CA or New York City, NY (location-based hybrid policy: staff expected to be in an office at least 25% of the time)
Visa sponsorship: Anthropic states they sponsor visas and retain an immigration lawyer to assist where possible

Benefits

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours and an office space for collaboration

How we're different

Anthropic emphasizes large-scale collaborative AI research, communication skills, and impact-driven work. They encourage applicants who may not meet every listed qualification to apply and highlight diversity and inclusion in hiring.