Senior Production Engineer - DGX Cloud

at Nvidia
πŸ“ World
πŸ“ Canada
πŸ“ United States
USD 168,000-333,500 per year
SENIOR
βœ… Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Go @ 4 Kubernetes @ 7 DevOps @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 SRE @ 4 GPU @ 4 Observability @ 4 AI @ 4 Slurm @ 7

Details

NVIDIA is hiring experienced Senior Production Engineers to help scale up its AI Infrastructure. This role is part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads. The team works on custom software for GPU asset provisioning, configuration, and lifecycle management across cloud providers and focuses on reliability, observability, and automated operations.

Responsibilities

  • Work on production systems enabling large scalable GPU clusters for a variety of AI workloads.
  • Develop and maintain custom software related to GPU asset provisioning, configuration, and lifecycle management across cloud providers.
  • Implement monitoring and health management capabilities to achieve high reliability, availability, and scalability of GPU assets.
  • Harness multiple data streams, including GPU hardware diagnostics and cluster/network telemetry, for observability and health management.
  • Collaborate across NVIDIA teams to ensure production AI clusters run reliably and with maximum performance.
  • Evaluate system failures and improve services based on a defined incident management process.
  • Contribute to the codebase β€” Production Engineering is treated as a software engineering discipline.

Requirements

  • Direct experience in a Production Engineering / DevOps / SRE role within a highly technical organization with demonstrable impact.
  • 8+ years in a similar role working on large-scale production systems.
  • Strong knowledge of site reliability principles and techniques, including reliability assessments, incident management, production system observability, monitoring and alerting, automated deployments, and toil elimination.
  • Technical knowledge of systems programming languages (examples given: Go, Python) and a solid understanding of data structures and algorithms.
  • Highly motivated with strong communication skills and the ability to work successfully with cross-functional teams, principals, and architects across geographies.

Ways to stand out

  • Technical competency managing and automating large-scale distributed systems independent of cloud providers.
  • Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
  • Proven operational excellence in maintaining reliable and performant AI infrastructure.

Compensation & Benefits

  • Base salary ranges: 168,000 USD - 270,250 USD for Level 4; 208,000 USD - 333,500 USD for Level 5.
  • Eligible for equity and benefits (link to NVIDIA benefits referenced in original posting).

Additional information

  • Applications accepted at least until May 22, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer committed to diversity and inclusion.