Principal Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia

📍 Durham, United States

USD 248,000-396,800 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 4 Go @ 7 Kubernetes @ 4 Python @ 7 Algorithms @ 7 Data Structures @ 7 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4 AI @ 4 Slurm @ 7

Details

NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale its AI infrastructure. You will work on DGX Cloud production systems that enable large, scalable GPU clusters for a variety of AI workloads. The role focuses on GPU resource scheduling on Kubernetes, cluster operations, operator development, node health monitoring, monitoring and telemetry, and ensuring high reliability and performance of production AI clusters.

Responsibilities

Build and maintain production systems that enable large, scalable GPU clusters for AI workloads.
Implement GPU resource scheduling capabilities on Kubernetes and develop related custom software.
Implement monitoring and health management capabilities for reliability, availability, and scalability of GPU assets, harnessing multiple data streams from GPU hardware diagnostics to cluster and network telemetry.
Collaborate with teams across NVIDIA to ensure production AI clusters run reliably and performantly.
Evaluate system failures and improve services using a defined incident management process.

Requirements

Significant software engineering experience (posting asks for 15+ years in a similar role) with demonstrable impact in a highly technical organization.
Hands-on software development experience with Kubernetes APIs and frameworks (beyond cluster operation), including operator development and cluster management.
Experience with cluster operations, node health monitoring, and GPU resource scheduling.
Strong technical knowledge in at least one systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
BS in Computer Science, Engineering, Physics, Mathematics or a comparable degree, or equivalent experience.
Strong communication skills and ability to work effectively across multi-functional teams and geographies.

Ways to Stand Out

Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
Proven operational excellence maintaining reliable and performant AI infrastructure.
Experience managing and automating large-scale distributed systems independent of cloud providers.

Compensation and Benefits

Base salary range: 248,000 USD - 396,750 USD (determined by location, experience, and pay of employees in similar positions).
Eligible for equity and NVIDIA benefits (link provided in original posting).

Additional Information

Applications accepted at least until May 17, 2026.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and emphasizes diversity and non-discrimination in hiring and promotion practices.