Senior Production Engineer - DGX Cloud

at Nvidia

📍 World
📍 Canada
📍 United States

USD 168,000-333,500 per year

SENIOR

✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 4 Kubernetes @ 7 DevOps @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 SRE @ 4 GPU @ 4 Observability @ 4 AI @ 4 Slurm @ 7

Details

NVIDIA is hiring experienced Senior Production Engineers to help scale up its AI Infrastructure. This role is part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads. The team works on custom software for GPU asset provisioning, configuration, and lifecycle management across cloud providers and focuses on reliability, observability, and automated operations.

Responsibilities

Work on production systems enabling large scalable GPU clusters for a variety of AI workloads.
Develop and maintain custom software related to GPU asset provisioning, configuration, and lifecycle management across cloud providers.
Implement monitoring and health management capabilities to achieve high reliability, availability, and scalability of GPU assets.
Harness multiple data streams, including GPU hardware diagnostics and cluster/network telemetry, for observability and health management.
Collaborate across NVIDIA teams to ensure production AI clusters run reliably and with maximum performance.
Evaluate system failures and improve services based on a defined incident management process.
Contribute to the codebase — Production Engineering is treated as a software engineering discipline.

Requirements

Direct experience in a Production Engineering / DevOps / SRE role within a highly technical organization with demonstrable impact.
8+ years in a similar role working on large-scale production systems.
Strong knowledge of site reliability principles and techniques, including reliability assessments, incident management, production system observability, monitoring and alerting, automated deployments, and toil elimination.
Technical knowledge of systems programming languages (examples given: Go, Python) and a solid understanding of data structures and algorithms.
Highly motivated with strong communication skills and the ability to work successfully with cross-functional teams, principals, and architects across geographies.

Ways to stand out

Technical competency managing and automating large-scale distributed systems independent of cloud providers.
Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
Proven operational excellence in maintaining reliable and performant AI infrastructure.

Compensation & Benefits

Base salary ranges: 168,000 USD - 270,250 USD for Level 4; 208,000 USD - 333,500 USD for Level 5.
Eligible for equity and benefits (link to NVIDIA benefits referenced in original posting).

Additional information

Applications accepted at least until May 22, 2026.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer committed to diversity and inclusion.