Product Manager, Compute Platform

USD 305,000-385,000 per year
MIDDLE
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

Not specified

Required Skills & Competences

Kubernetes @ 3 Algorithms @ 3 Distributed Systems @ 6 Leadership @ 3 Communication @ 3 Prioritization @ 3 Product Management @ 3 GPU @ 3 Observability @ 3 AI @ 3 Slurm @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the role

As a Product Manager focused on Compute Platform, you’ll partner with Infrastructure, Compute Operations, Engineering, Finance & Strategy, and Research to build the scheduling, orchestration, and capacity management systems that power Anthropic’s compute infrastructure—the foundation on which every model training run, evaluation, and inference workload depends.

Key responsibilities described in the posting include:

  • Partner with Infrastructure to build the systems that determine how jobs are scheduled, prioritized, and allocated across Anthropic’s fleet of GPU and accelerator clusters.
  • Improve cluster utilization, cost efficiency, and researcher velocity by defining semantic layers for job scheduling, establishing resource guarantees, and making trade-offs to keep infrastructure running at peak capacity.
  • Drive the evolution of the compute platform to support diverse workloads (large-scale training, fine-tuning, real-time inference, batch evaluation) with distinct scheduling requirements and priority levels.
  • Define and own roadmap across job scheduling primitives, capacity allocation policies, preemption and fairness frameworks, quota management, and observability tooling.

Responsibilities

  • Deeply understand internal customers across Research, Infrastructure, Product, and Finance (e.g., researchers needing guaranteed resources for multi-week training runs, platform teams with strict latency SLAs).
  • Define and iterate on the semantic layer for job scheduling: abstractions, priority tiers, resource classes, and preemption policies.
  • Partner with engineering leads to design scheduling capabilities that maximize cluster utilization while honoring resource guarantees; ensure job prerequisites (data, checkpoints, hardware affinity) are validated before launch.
  • Drive product strategy and roadmap for compute capacity management, including quota systems, fairness policies, bin-packing optimizations, and gang-scheduling for distributed workloads.
  • Own the trade-off framework between utilization efficiency, job latency, cost, and reliability; communicate prioritization decisions to senior leadership.
  • Collaborate with Capacity Strategy & Operations on capacity planning models, demand forecasting, and cost-to-serve analytics.
  • Build and champion observability tools and dashboards that provide real-time visibility into cluster health, queue depth, scheduling efficiency, and resource waste.

Requirements / Qualifications

  • 7+ years of product management experience, with deep exposure to compute infrastructure, distributed systems, or scheduling/orchestration platforms.
  • Experience taking technical infrastructure products from infancy to scale—building from the ground up and growing to serve demanding internal or external customers.
  • Track record building platform products that balance multiple users/stakeholders and making prioritization trade-offs between utilization, latency, cost, and fairness.
  • Ability to internalize complex technical systems (job schedulers, cluster managers, resource orchestrators) and translate that understanding into a product vision.
  • Credibility across functions: able to discuss scheduling algorithms with engineers, capacity economics with finance, and infrastructure strategy with leadership.
  • Strong instinct for connecting technical decisions to business outcomes; scrappy and resourceful in a fast-moving environment.

Strong candidates may have

  • Built or scaled job scheduling, resource orchestration, or workload management systems for large-scale compute clusters (examples given: Kubernetes, Slurm, Borg, YARN, or custom schedulers).
  • Deep familiarity with GPU/accelerator scheduling challenges: gang-scheduling, topology-aware placement, preemption, and hardware affinity constraints.
  • Experience defining/enforcing SLAs and resource guarantees for compute workloads and validating job prerequisites (data readiness, checkpoints, hardware compatibility) before scheduling.
  • Capacity planning experience across cloud and on-premises infrastructure, including cost modeling, demand forecasting, and vendor management for compute procurement.
  • Experience with observability and efficiency tooling for distributed infrastructure—building dashboards, automation, and governance workflows that drive utilization and cost accountability.

Compensation

Annual Salary: $305,000 - $385,000 USD

Logistics

  • Minimum education: Bachelor’s degree or equivalent combination of education, training, and/or experience.
  • Required field of study: a field relevant to the role as demonstrated through coursework, training, or professional experience.
  • Minimum years of experience: will correlate with internal job level requirements for the position.
  • Location-based hybrid policy: currently expect staff to be in one of our offices at least 25% of the time (some roles may require more time in office).
  • Visa sponsorship: The posting states: “We do sponsor visas!”—the company retains an immigration lawyer and will make reasonable efforts to obtain a visa when an offer is made.

How we're different

Anthropic views AI research as large-scale, collaborative, and impact-focused. The team works as a single cohesive group on a few large-scale research efforts and values communication and cross-functional collaboration.

Benefits

The posting states Anthropic offers competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration. The posting also includes candidate AI usage guidance and applicant safety guidance regarding recruiter communications.