Technical Program Manager, Cloud Infrastructure

at Nvidia

USD 168,000-322,000 per year

MIDDLE SENIOR

✅ Hybrid

Tech Stack
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

AI @ 2 API @ 3 Communication @ 3 Compliance GPU @ 3 Jira @ 5 Kubernetes @ 3 Machine Learning Observability Security Terraform @ 3

Details

Overview

NVIDIA is seeking an accomplished Technical Program Manager (TPM) to join its NVIDIA DGX Cloud team. The role focuses on delivering value to DGX Cloud customers by collaborating with CSPs (Cloud Service Providers), emerging cloud providers, and internal engineering teams to build AI capacity and infrastructure across the globe.

Responsibilities

Work closely with storage engineering and network engineering teams to define and communicate requirements to CSPs and NCPs (NVIDIA Cloud Providers). Drive alignment and a POR for capacity blocks based on workload needs.
Drive early engagement with CSPs and NCPs to understand their managed storage and network solutions, and influence alignment with the NVIDIA Cloud roadmap.
Gather technical requirements, develop comprehensive roadmaps, establish clear milestones, and ensure adherence to the Product Lifecycle (PLC) process.
Manage ongoing capacity operations and engineering engagement with CSP and NCP partners, collaborating with engineering leads. Focus on availability, maintenance, and other critical performance indicators.
Partner within NVIDIA to understand workload requirements and related hardware and infrastructure needs, including speeds and feeds, to optimize infrastructure readiness with cloud vendors and NVIDIA Cloud Providers.
Use Jira and other program management platforms to instill rigor and structure in managing engineering deliverables.
Identify and drive opportunities to onboard adoption of third-party and in-house cloud infrastructure solutions for deployments, support, security, compliance, and observability across DGX Cloud.
Establish KPIs and quantitatively demonstrate value and impact delivered by programs.
Proactively identify, resolve, and mitigate risks and issues affecting scope, schedule, and quality across program aspects.
Encourage a culture of continuous improvement by identifying process improvement opportunities within cloud infrastructure operations.

Requirements

10+ years of technical program management experience, including driving planning and execution of large-scale cloud infrastructure programs with outside organizations.
Extensive hands-on experience in cloud infrastructure, preferably gained from working at a major Cloud Service Provider (CSP).
Domain knowledge in bring-up and end-to-end operations of compute, storage, and GPU (including common failure points at HW and SW levels).
Expert-level proficiency with Jira, Smartsheet, or similar program management tools, and ability to guide engineering teams on tool usage.
Outstanding strategic and tactical thinking; ability to build consensus and drive program success.
Comfort working within ambiguous environments.
Excellent communication and technical presentation skills, particularly for executive audiences.
BS or MS in Electrical Engineering or Computer Science, or equivalent experience.

Ways to Stand Out

In-depth knowledge of NVIDIA GPU products, including deployment and bring-up.
Working knowledge of cloud technologies (Kubernetes, API integration, Terraform, etc.).
Enthusiastic, energetic, responsive individual focused on identifying process improvement opportunities.
Significant experience with productivity tools and process automation (major plus).
Deep familiarity with cloud-native product/services environments, and familiarity with AI/ML infrastructure and cloud/services.