Senior Technical Program Manager, DGX Cloud Software Products and Services

at Nvidia
USD 200,000-322,000 per year
SENIOR
✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences

Distributed Systems @ 4 Machine Learning @ 4 Communication @ 7 Git @ 6 SRE @ 4 Jira @ 6 Project Management @ 6 GPU @ 4 Deep Learning @ 4 Observability @ 4 AI @ 4 NCCL @ 4 Slurm @ 4

Details

NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product workloads. This role is for an IC5 Technical Program Manager to lead cross-functional, strategic programs focused on resilience, reliability, operational scale, and goodput for DGX Cloud. The TPM will partner with engineering, SRE, operations, and researchers to deliver fault-tolerant, high-availability training and inference environments at scale and to guide resilience reference architecture and modular software components.

Responsibilities

  • Lead cross-functional programs that improve resilience, reliability, operational scale, and fleet-wide goodput across DGX Cloud.
  • Partner across infrastructure, platform, site reliability, operational, and tenant teams to identify systemic risks, resolve cross-stack dependencies, and improve end-to-end service stability.
  • Drive definition and adoption of resilience reference stacks, operational standards, and scalable guidelines to strengthen service readiness and recovery.
  • Partner with engineering teams and researchers to support development and delivery of open, modular software components for resilience, facilitating reusable and extensible capabilities across the platform.
  • Build and scale resilience tooling and operational mechanisms that improve observability, failure detection and attribution, root cause analysis, recovery orchestration, and operational readiness.
  • Define, measure, and improve goodput using data-driven insights to increase usable fleet capacity, workload efficiency, and customer outcomes at scale.
  • Establish clear metrics, dashboards, and operating cadences to track program health, reliability posture, operational maturity, and performance.

Requirements

  • MS in EE or CS, or equivalent experience.
  • 8+ years of experience in program management of large-scale software or infrastructure projects.
  • Proven track record of leading complex cross-functional programs in cloud, infrastructure, distributed systems, or platform environments.
  • Strong analytical skills to assess issues across infrastructure, software, and operational layers.
  • Excellent organizational skills and experience with project management tools (e.g., Jira, Aha!, Confluence) and distributed version control systems (e.g., Git).
  • Solid understanding of reliability engineering, resilience development, and service performance metrics including goodput, efficiency, and utilization.
  • Experience working alongside engineering, SRE, operations, and technical collaborators in ambiguous, high-complexity environments.
  • Outstanding communication and presentation skills for diverse technical and non-technical audiences, with strong problem-solving and conflict management abilities.

Ways To Stand Out

  • Background in computer science, machine learning, deep learning, open-source software, GPU technology, AI infrastructure, or large-scale compute platforms.
  • Experience with large-scale AI training environments (e.g., distributed training frameworks, checkpointing, NCCL, Slurm or other schedulers).
  • Prior experience managing customer workflows on large-scale distributed computing and working with AI researchers or directly training and evaluating AI models.
  • Proven ability to harness AI-enabled workflows and tools to improve program management efficiency, decision-making, execution visibility, and operational efficiency.

Compensation & Other Details

  • Base salary ranges provided by location and level. Base salary range for Level 5 (IC5) is 200000 USD - 322000 USD. (Level 4 range also listed: 168000 USD - 258750 USD.)
  • Eligible for equity and benefits. See NVIDIA benefits links referenced in the posting.
  • Application acceptance at least until May 8, 2026.
  • Work style indicator: #LI-Hybrid (hybrid workplace)