Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 144,000-270,200 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 4 Go @ 7 Kubernetes @ 4 Python @ 7 Algorithms @ 7 Data Structures @ 7 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4

Details

NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale up its AI infrastructure. You will be part of the DGX Cloud team responsible for production systems that enable large, scalable GPU clusters to be used for a variety of AI workloads. The role focuses on custom software related to scheduling GPU resources on Kubernetes, implementing monitoring and health management, and working across teams to ensure reliable, high-performance production AI clusters.

Responsibilities

Work on production systems enabling large-scale GPU clusters for AI workloads, including custom software for scheduling GPU resources on Kubernetes.
Implement monitoring and health management capabilities to deliver industry-leading reliability, availability, and scalability of GPU assets, using multiple data streams from GPU hardware diagnostics to cluster and network telemetry.
Collaborate with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
Evaluate system failures and improve services through a well-defined incident management process.

Requirements

Significant software engineering experience (5+ years) in highly technical organizations with demonstrable impact from your work.
Software development experience with Kubernetes APIs and frameworks (not solely cluster operation) including cluster operations, operator development, and node health monitoring.
Experience with GPU resource scheduling and operating large-scale production systems.
Strong technical knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree, or equivalent experience.
Highly motivated with strong communication skills; able to work successfully with cross-functional teams, principals, and architects across geographies.

Ways to stand out

Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
Advanced hands-on experience and deep understanding of cluster management systems such as Kubernetes, Slurm, and Bright Cluster Manager.
Proven operational excellence in maintaining reliable and performant AI infrastructure.

Compensation & Benefits

Base salary ranges by level: Level 3: 144,000 USD - 230,000 USD; Level 4: 168,000 USD - 270,250 USD. Final base salary will be determined based on location, experience, and pay of employees in similar positions.
Eligible for equity and company benefits (see NVIDIA benefits).

Additional information

Applications for this job will be accepted at least until August 9, 2025.
NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.