Senior Manager, Technical Program Management - DGX Cloud
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Grafana @ 6 Prometheus @ 6 GCP @ 4 MLOps @ 4 Leadership @ 4 Communication @ 4 Prioritization @ 4 Reporting @ 4 System Architecture @ 7 GPU @ 4Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, we're at the forefront of AI innovation powering breakthroughs in research, autonomous vehicles, robotics, and more. The DGX Cloud team builds and operates the AI infrastructure that fuels this progress. NVIDIA is seeking an experienced and driven Sr Manager for Technical Program Management to lead a high-impact team within our DGX Cloud Infrastructure organization. You will play a critical role in driving sophisticated, cross-functional programs involving Compute Platform, cluster bring-ups (including cutting-edge systems like GB200), and ensuring world-class fleet availability, occupancy optimization, and infrastructure metrics tracking across the global DGX Cloud fleet.
As a DGX Cloud leader within the Technical Program Management team, you will serve as the vital bridge between NVIDIA Research and DGXC Engineering, driving the development of resilient, high-performance infrastructure for AI training and inference. You'll lead and scale a team that supports mission-critical systems empowering over 1,000 researchers. Your mission is to accelerate NVIDIA's research by delivering a world-class AI environment — from GPU clusters to software stack — setting industry standards in productivity, performance, and global impact.
Responsibilities
- Lead and scale a high-performing team of Technical Program Managers focused on delivering a world-class AI platform that empowers 1K+ NVIDIA researchers. Prioritize developer productivity, platform usability, and end-to-end user experience.
- Drive programs involving Slurm architecture, configuration, workload management, job prioritization/fair-share policies, alternative schedulers, and hybrid scheduling architectures to enable capacity management and allocation across internal research teams.
- Manage end-to-end cluster bring-ups and integrations with MLOps stacks; define and operate operational models, fleet efficiency metrics, and deployments across hyperscaler environments (OCI, GCP, and others).
- Lead capacity modeling, demand forecasting, and supply-demand balancing; apply prioritization frameworks and collaborate with governance teams to define and implement prioritization strategies.
- Lead initiatives to reduce GPU idle waste, improve cluster utilization metrics, and drive developer-centric programs that accelerate internal developer velocity.
- Establish and enforce program governance, roadmap planning, and risk management processes. Define key metrics and reporting frameworks to ensure transparency and accountability across engineering and operations.
- Develop and execute communication strategies to keep stakeholders informed at all levels — from engineering contributors to NVIDIA leadership — about program progress, blockers, and impact.
Requirements
- 15+ years of program management experience leading large-scale software, AI/ML, and infrastructure programs in fast-paced, matrixed environments; 8+ years managing a team.
- Hands-on experience supporting AI/ML platform development, including workload orchestration, platform reliability, researcher tooling, GPU resource management, hardware readiness states, and integration with customer MLOps pipelines.
- Proven track record delivering sophisticated AI/ML infrastructure programs at scale — ideally in cloud, hyperscaler, or enterprise datacenter settings — with deep understanding of system architecture and cluster deployments.
- Strong grasp of capacity modeling, forecasting techniques, and demand/supply reconciliation in compute environments; experience using fleet-wide metrics such as availability, utilization, and occupancy to drive operational excellence and roadmap prioritization.
- Proficiency with monitoring and observability tools such as Grafana, Prometheus, or scheduler-native tools to monitor job efficiency, wait times, and node health.
- MS in Computer Science or a related technical field, or equivalent experience.
Ways to Stand Out / Preferred Qualifications
- Highly motivated with strong communication skills and proven ability to work successfully with cross-functional teams across organizational boundaries and geographies.
- Solid understanding of cloud technologies is a plus.
- Experience with new product introduction and program managing research teams.
- Background with productivity tools and process automation is a big plus.
Benefits & Additional Info
- Base salary range: 232,000 USD - 368,000 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits (see NVIDIA benefits page).
- Applications accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.
- Location: US, CA, Santa Clara. #LI-Hybrid