Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Kubernetes @ 3
Terraform @ 3
Communication @ 3
Jira @ 3
API @ 3
Compliance @ 3
GPU @ 3
Deep Learning @ 3
Observability @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA's deep learning platforms are widely used by academic institutions, startups, and major internet companies. The DGX Cloud team is looking for a Technical Program Manager (TPM) with extensive experience in cloud infrastructure bring-up with external partners to collaborate with Cloud Service Providers (CSPs), NVIDIA Cloud Providers (NCPs), and internal engineering teams to build AI capacity and infrastructure globally.
Responsibilities
- Partner with Engineering, Infrastructure, and Software teams to drive programs related to AI capacity enablement and management.
- Define and communicate requirements to CSPs and NCPs, coordinating with storage and network engineering teams to drive alignment and capacity planning (POR) based on workload needs.
- Drive early engagement with CSPs and NCPs to understand managed storage and network solutions and influence roadmap alignment.
- Gather technical requirements, develop roadmaps, establish milestones, and ensure adherence to the Product Lifecycle (PLC) process.
- Manage ongoing capacity operations and engineering engagement with partners, focusing on availability, maintenance, and other performance indicators.
- Work with internal teams to understand workload requirements and related hardware/infrastructure needs (speeds and feeds to optimize readiness with cloud vendors and providers).
- Use Jira and other program management platforms to bring rigor and structure to engineering deliverables.
- Identify and drive adoption of third-party and in-house cloud infrastructure solutions for deployment, support, security, compliance, and observability across DGX Cloud.
- Establish KPIs and quantitatively demonstrate program value and impact.
- Proactively identify, resolve, and mitigate risks and issues affecting scope, schedule, and quality.
- Promote continuous improvement and process enhancements within cloud infrastructure operations.
Requirements
- 10+ years of technical program management experience, with experience planning and executing large-scale cloud infrastructure programs involving external organizations; strong focus on software engineering projects in matrixed organizations.
- Extensive hands-on experience in cloud infrastructure, preferably from a major Cloud Service Provider (CSP).
- Domain knowledge in bring-up and end-to-end operations of compute, storage, and GPU, including common failure points at hardware and software levels.
- Expert-level proficiency with Jira, Smartsheet, or similar program management tools and ability to guide engineering teams on their use.
- Strong strategic and tactical thinking, consensus building, and program execution skills.
- Comfortable working in ambiguous, fast-paced environments and able to grow into evolving responsibilities.
- Excellent communication and technical presentation skills, especially for executive audiences.
- BS or MS in Electrical Engineering or Computer Science, or equivalent experience.
Ways to Stand Out
- In-depth knowledge of NVIDIA GPU products, including deployment and bring-up.
- Working knowledge of cloud technologies such as Kubernetes, API integration, Terraform.
- Experience with productivity tools and process automation.
- Deep familiarity with cloud-native services/environments and AI/ML infrastructure.
Compensation & Benefits
- Base salary ranges provided by location and level:
- Level 4: 168,000 USD - 258,750 USD
- Level 5: 200,000 USD - 322,000 USD
- Eligible for equity and benefits (see NVIDIA benefits links referenced in posting).
Additional Information
- Location shown as Santa Clara, CA, United States. #LI-Hybrid
- Applications accepted at least until June 12, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.