Senior Software Engineer, DGX Cloud Production Engineering

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Security @ 4 Go @ 7 Kubernetes @ 4 Linux @ 4 Terraform @ 4 Python @ 7 ArgoCD @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 4 Networking @ 4 Debugging @ 4 API @ 4 GPU @ 4 Observability @ 4 AI @ 4

Details

NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments.

Responsibilities

  • Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
  • Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
  • Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
  • Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
  • Participate in on-call, incident response, debugging, and durable follow-up work.
  • Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready.

Requirements

  • 8+ years of experience building or operating production infrastructure.
  • Strong programming skills in Python, Go, or similar languages.
  • Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation.
  • Ability to troubleshoot distributed systems in production.
  • Clear communication and ability to work across teams.
  • BS/MS in Computer Science or equivalent experience.

Ways to stand out

  • Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation.
  • Experience with SLOs, on-call, incident response, observability, and reliability practices.
  • Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure.

Compensation

  • Base salary range for Level 4: 184,000 USD - 287,500 USD per year.
  • Base salary range for Level 5: 224,000 USD - 356,500 USD per year.
  • You will also be eligible for equity and benefits.

Additional information

  • Applications for this job will be accepted at least until May 22, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.