Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Kubernetes @ 6
Leadership @ 7
Technical Proficiency @ 6
Communication @ 4
IaaS @ 6
Technical Leadership @ 7
GPU @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is leading the industry in delivering accelerated computing in cloud and enterprise environments. The team is focused on solving large-scale technical challenges that impact customers worldwide. This role will lead the technical strategy and implementation for DGX Cloud GPU fleet lifecycle, health, observability, utilization monitoring, and automated remediation across multiple environments (bare metal, cloud service providers, and neoclouds).
Responsibilities
- Define and drive the technical implementation and architecture for DGX Cloud operations practice for GPU fleet lifecycle.
- Lead full software and system lifecycle activities: ideation, architecture, design, development, deployment, operations, and continuous lifecycle management for large technical scope.
- Define and develop auto-remediation strategies to detect, fix, validate, and restore-to-service critical systems.
- Collaborate cross-functionally to drive technical strategy and best practices into DGX Cloud engineering practices.
- Guide technical delivery into DGX Cloud across delivery environments: enterprise, public cloud, and high-security/isolated/sovereign environments.
- Engage stakeholders including customers, infrastructure providers, and partners to ensure operational excellence and availability.
Requirements
- 15-18+ years overall in technical roles with focus on operations and automation for cloud infrastructure, platforms, and applications.
- 5-10+ years of lead experience.
- BS/MS or higher, or equivalent experience in systems/software engineering or related engineering fields.
- Technical proficiency in multi-tenant data center and cloud-native architectures, including bare metal, virtualization, containerization, and higher-level abstractions (IaaS, Kubernetes, Slurm), and AI/ML platforms and applications.
- Demonstrated success delivering technically complex solutions that provide transparency into resource utilization, performance, and operational insights.
- Strong technical leadership: ability to synthesize multi-functional needs into architecture and design and guide execution across teams.
- Excellent communication and partnership skills for engaging peers, partners, and customers.
Ways to Stand Out
- Real-world experience applying AI to component- and system-level issue identification and remediation.
- Direct experience designing, developing, delivering, and operating highly available, scaled systems in enterprise and cloud environments.
- History of creating scalable processes and extensible systems to facilitate operations at scale.
- Familiarity with open source ecosystems and the ability to collaborate and influence open source project governance.
Compensation & Other Information
- Base salary range: 308,000 USD - 471,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits.
- Applications accepted at least until January 6, 2026.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.