Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia

📍 Austin, United States

USD 144,000-270,200 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 4 Go @ 4 Kubernetes @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4

Details

NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale up its AI infrastructure. You will work on production systems that enable large scalable GPU clusters for a variety of AI workloads, focusing on scheduling GPU resources on Kubernetes, monitoring and health management, and improving reliability and performance of production AI clusters.

Responsibilities

Build and maintain production systems for large scalable GPU clusters used for AI workloads.
Develop custom software related to scheduling GPU resources on Kubernetes.
Implement monitoring and health management capabilities to enable industry-leading reliability, availability, and scalability of GPU assets.
Harness multiple data streams, including GPU hardware diagnostics, cluster telemetry, and network telemetry.
Work across NVIDIA teams to ensure production AI clusters run reliably and consistently with maximum performance.
Evaluate system failures and improve services following a defined incident management process.

Requirements

5+ years in a similar software engineering role working on large-scale production systems.
Direct experience with software development using Kubernetes APIs and frameworks (not just operating a cluster).
Experience in cluster operations, operator development, and node health monitoring.
Experience with GPU resource scheduling and managing GPU-based infrastructure.
Technical knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree — or equivalent experience.
Strong communication skills and ability to work with cross-functional teams, principals, and architects across geographies.

Ways to Stand Out

Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
Proven operational excellence in maintaining reliable and performant AI infrastructure.

Compensation & Benefits

Base salary range:
- Level 3: 144,000 USD - 230,000 USD per year
- Level 4: 168,000 USD - 270,250 USD per year
Eligible for equity and company benefits (see NVIDIA benefits page).

Additional Information

Location: Austin, TX, United States.
Employment type: Full time.
Applications accepted at least until August 25, 2025.
NVIDIA is an equal opportunity employer and values workplace diversity.