Distinguished Engineer, GPU Fleet Operations Automation

at Nvidia
USD 308,000-471,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Kubernetes @ 6 Leadership @ 7 Technical Proficiency @ 6 Communication @ 4 IaaS @ 6 Technical Leadership @ 7 GPU @ 4

Details

NVIDIA is leading the industry in delivering accelerated computing in cloud and enterprise environments. The team is focused on solving large-scale technical challenges that impact customers worldwide. This role will lead the technical strategy and implementation for DGX Cloud GPU fleet lifecycle, health, observability, utilization monitoring, and automated remediation across multiple environments (bare metal, cloud service providers, and neoclouds).

Responsibilities

  • Define and drive the technical implementation and architecture for DGX Cloud operations practice for GPU fleet lifecycle.
  • Lead full software and system lifecycle activities: ideation, architecture, design, development, deployment, operations, and continuous lifecycle management for large technical scope.
  • Define and develop auto-remediation strategies to detect, fix, validate, and restore-to-service critical systems.
  • Collaborate cross-functionally to drive technical strategy and best practices into DGX Cloud engineering practices.
  • Guide technical delivery into DGX Cloud across delivery environments: enterprise, public cloud, and high-security/isolated/sovereign environments.
  • Engage stakeholders including customers, infrastructure providers, and partners to ensure operational excellence and availability.

Requirements

  • 15-18+ years overall in technical roles with focus on operations and automation for cloud infrastructure, platforms, and applications.
  • 5-10+ years of lead experience.
  • BS/MS or higher, or equivalent experience in systems/software engineering or related engineering fields.
  • Technical proficiency in multi-tenant data center and cloud-native architectures, including bare metal, virtualization, containerization, and higher-level abstractions (IaaS, Kubernetes, Slurm), and AI/ML platforms and applications.
  • Demonstrated success delivering technically complex solutions that provide transparency into resource utilization, performance, and operational insights.
  • Strong technical leadership: ability to synthesize multi-functional needs into architecture and design and guide execution across teams.
  • Excellent communication and partnership skills for engaging peers, partners, and customers.

Ways to Stand Out

  • Real-world experience applying AI to component- and system-level issue identification and remediation.
  • Direct experience designing, developing, delivering, and operating highly available, scaled systems in enterprise and cloud environments.
  • History of creating scalable processes and extensible systems to facilitate operations at scale.
  • Familiarity with open source ecosystems and the ability to collaborate and influence open source project governance.

Compensation & Other Information

  • Base salary range: 308,000 USD - 471,500 USD (determined based on location, experience, and pay of employees in similar positions).
  • Eligible for equity and benefits.
  • Applications accepted at least until January 6, 2026.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.