Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia
USD 152,000-241,500 per year
SENIOR
✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Go @ 4 Kubernetes @ 7 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 GPU @ 4 AI @ 4 Slurm @ 7

Details

NVIDIA is hiring experienced software engineers to help scale up its AI Infrastructure. You will work on production systems that enable large, scalable GPU clusters for a variety of AI workloads. The team expects significant software engineering experience with cluster operations, operator development, node health monitoring and GPU resource scheduling. Candidates should be creative, passionate about GPUs, and enjoy working in a challenging, fast-evolving environment.

Responsibilities

  • Be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.
  • Design and develop a massively distributed scalable platform used to identify, diagnose and remediate non-performant GPU assets.
  • Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
  • Evaluate system failures and improve services based on a well-defined incident management process.

Requirements

  • Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work.
  • Highly motivated with strong communication skills; ability to work successfully with cross-functional teams, principals, and architects and coordinate effectively across organizational boundaries and geographies.
  • 5+ years in a similar role and experience on large-scale production systems.
  • BS in Computer Science, Engineering, Physics, Mathematics or a comparable degree, or equivalent experience.
  • Technical knowledge including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.

Ways to stand out from the crowd (Preferred / Nice-to-have)

  • Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
  • Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager).
  • Prior experience in asynchronous workflows and/or event-driven architecture.
  • Proven operational excellence in maintaining reliable and performant infrastructure.

Compensation & Benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD.

You will also be eligible for equity and benefits. Applications for this job will be accepted at least until May 14, 2026.

Additional information

  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The company does not discriminate on the basis of characteristics protected by law.