Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia

📍 United States

USD 152,000-241,500 per year

SENIOR

✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 4 Kubernetes @ 7 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 GPU @ 4 AI @ 4 Slurm @ 7

Details

NVIDIA is hiring experienced software engineers to help scale up its AI Infrastructure. You will work on production systems that enable large, scalable GPU clusters for a variety of AI workloads. The team expects significant software engineering experience with cluster operations, operator development, node health monitoring and GPU resource scheduling. Candidates should be creative, passionate about GPUs, and enjoy working in a challenging, fast-evolving environment.

Responsibilities

Be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.
Design and develop a massively distributed scalable platform used to identify, diagnose and remediate non-performant GPU assets.
Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
Evaluate system failures and improve services based on a well-defined incident management process.

Requirements

Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work.
Highly motivated with strong communication skills; ability to work successfully with cross-functional teams, principals, and architects and coordinate effectively across organizational boundaries and geographies.
5+ years in a similar role and experience on large-scale production systems.
BS in Computer Science, Engineering, Physics, Mathematics or a comparable degree, or equivalent experience.
Technical knowledge including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.

Ways to stand out from the crowd (Preferred / Nice-to-have)

Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager).
Prior experience in asynchronous workflows and/or event-driven architecture.
Proven operational excellence in maintaining reliable and performant infrastructure.

Compensation & Benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD.

You will also be eligible for equity and benefits. Applications for this job will be accepted at least until May 14, 2026.

Additional information

This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The company does not discriminate on the basis of characteristics protected by law.