Principal Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia
USD 248,000-396,800 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Software Development @ 4 Go @ 7 Kubernetes @ 4 Python @ 7 Algorithms @ 7 Data Structures @ 7 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4 AI @ 4 Slurm @ 7

Details

NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale its AI infrastructure. You will work on DGX Cloud production systems that enable large, scalable GPU clusters for a variety of AI workloads. The role focuses on GPU resource scheduling on Kubernetes, cluster operations, operator development, node health monitoring, monitoring and telemetry, and ensuring high reliability and performance of production AI clusters.

Responsibilities

  • Build and maintain production systems that enable large, scalable GPU clusters for AI workloads.
  • Implement GPU resource scheduling capabilities on Kubernetes and develop related custom software.
  • Implement monitoring and health management capabilities for reliability, availability, and scalability of GPU assets, harnessing multiple data streams from GPU hardware diagnostics to cluster and network telemetry.
  • Collaborate with teams across NVIDIA to ensure production AI clusters run reliably and performantly.
  • Evaluate system failures and improve services using a defined incident management process.

Requirements

  • Significant software engineering experience (posting asks for 15+ years in a similar role) with demonstrable impact in a highly technical organization.
  • Hands-on software development experience with Kubernetes APIs and frameworks (beyond cluster operation), including operator development and cluster management.
  • Experience with cluster operations, node health monitoring, and GPU resource scheduling.
  • Strong technical knowledge in at least one systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
  • BS in Computer Science, Engineering, Physics, Mathematics or a comparable degree, or equivalent experience.
  • Strong communication skills and ability to work effectively across multi-functional teams and geographies.

Ways to Stand Out

  • Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
  • Proven operational excellence maintaining reliable and performant AI infrastructure.
  • Experience managing and automating large-scale distributed systems independent of cloud providers.

Compensation and Benefits

  • Base salary range: 248,000 USD - 396,750 USD (determined by location, experience, and pay of employees in similar positions).
  • Eligible for equity and NVIDIA benefits (link provided in original posting).

Additional Information

  • Applications accepted at least until May 17, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and emphasizes diversity and non-discrimination in hiring and promotion practices.