Engineering Manager, DGX Cloud Production Engineering

at Nvidia
šŸ“ World
šŸ“ Canada
šŸ“ United States
USD 224,000-356,500 per year
MIDDLE
āœ… Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Security @ 3 Kubernetes @ 3 Distributed Systems @ 3 Communication @ 6 Networking @ 3 SRE @ 3 Prioritization @ 6 GPU @ 3 Observability @ 3 AI @ 3

Details

NVIDIA DGX Cloud is building the operating model for reliable, scalable GPU infrastructure across internal, partner, and on-prem environments. This role leads a team of software and production engineers focused on Kubernetes-based operations, automation, reliability, and cluster lifecycle tooling. The manager will run current production systems while building automation and engineering practices for the next generation of DGX Cloud infrastructure.

Responsibilities

  • Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments.
  • Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response.
  • Help define team priorities, roadmap, staffing, and operational ownership.
  • Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness.
  • Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes.
  • Coach engineers, grow technical leaders, and create clear ownership across ambiguous problem spaces.

Requirements

  • 8+ overall years of industry experience, including 2+ years leading or managing engineers.
  • Experience building or operating production infrastructure, cloud platforms, Kubernetes environments, or distributed systems.
  • Strong understanding of reliability engineering, automation, observability, incident response, and operational excellence.
  • Ability to work across teams and influence without direct authority.
  • Clear communication, strong prioritization, and sound judgment in fast-moving environments.
  • BS/MS in Computer Science or equivalent experience.

Ways to stand out

  • Experience leading SRE, production engineering, infrastructure automation, or platform teams.
  • Experience with GPU infrastructure, Kubernetes fleet operations, GitOps, BMaaS/VMaaS, managed Kubernetes, or multi-cloud environments.
  • Track record of reducing toil, improving SLOs, and turning operational work into software-driven systems.

Compensation & Benefits

  • Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
  • Eligible for equity and benefits (see NVIDIA benefits information referenced in the posting).

Other information

  • Applications accepted at least until May 31, 2026. This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and provides a diverse work environment.