Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 2
GCP @ 3
Leadership @ 3
AWS @ 3
Azure @ 3
Communication @ 3
GPU @ 3
Observability @ 3
AI @ 3
Slurm @ 2
HPC @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The Compute team runs the compute infrastructure that supports model training, evaluation, and inference workloads.
About the role
As a Technical Program Manager on the Compute team, you will drive planning, coordination, and execution of programs that keep Anthropic's compute infrastructure running efficiently at scale. You will take ownership of critical workstreams across the compute lifecycle, from procurement and bringing capacity online to allocation and utilization across teams. The exact focus will depend on your strengths and the team's needs. You will partner with Infrastructure, Systems, Research, Finance, and Capacity Engineering to shape processes, tooling, and coordination mechanisms.
Responsibilities
- Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams
- Build and maintain operational visibility into the compute fleet: supply, demand, utilization, and health
- Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms
- Partner with engineering and research leadership to prioritize and align on compute planning, allocation, and usage
- Identify and close operational gaps via tooling, improved processes, or better cross-team communication
- Own trade-off discussions between utilization, cost, latency, and reliability; synthesize inputs from technical and business stakeholders and communicate decisions to leadership
- Develop and improve processes and frameworks for planning, tracking, and executing compute programs at scale
Requirements / Qualifications
- 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments
- Experience leading complex, cross-functional programs involving multiple engineering teams with competing priorities and ambiguous requirements
- Experience working with research or ML teams and translating their needs into operational plans and technical requirements
- Comfort diving into technical details (cloud infrastructure, cluster management, job scheduling, resource orchestration) while maintaining program-level visibility
- Ability to define scope and build processes in ambiguous, fast-moving environments
- Strong communication skills and credibility with engineers, researchers, finance, and executive leadership
- Track record of building trust with engineering teams and driving changes through influence rather than authority
Strong candidates may also have
- Experience managing compute capacity across multiple cloud providers (AWS, GCP, Azure) or hybrid cloud/on-premises environments
- Familiarity with job scheduling, resource orchestration, or workload management systems (Kubernetes, Slurm, Borg, YARN, or custom schedulers)
- Experience with GPU or accelerator infrastructure and large-scale ML training/inference workloads
- Experience building or improving observability for infrastructure systems: dashboards, alerting, efficiency metrics, or cost attribution
- Capacity planning experience including demand forecasting, cost modeling, or hardware lifecycle management
- Experience scaling through hypergrowth in AI/ML, HPC, or large-scale cloud environments
Logistics
- Locations: San Francisco, CA; New York City, NY; Seattle, WA
- Education: Bachelor’s degree in a related field or equivalent experience required
- Location-based hybrid policy: staff expected to be in one of Anthropic's offices at least 25% of the time (some roles may require more time in office)
- Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist where possible
- Deadline to apply: applications are received on a rolling basis
Compensation & Benefits
- Annual salary range: $290,000 - $365,000 USD
- Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office workspace
How we're different
Anthropic emphasizes large-scale, collaborative AI research, valuing impact and communication. The organization hosts frequent research discussions and focuses on high-impact research directions (examples listed in the posting include work related to GPT-3, interpretability, scaling laws, and learning from human preferences).