Senior Software Engineer, Distributed Systems Engineer - DGX Cloud
at Nvidia
USD 144,000-270,200 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 4 Go @ 4 Kubernetes @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4Details
NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale up its AI infrastructure. You will work on production systems that enable large scalable GPU clusters for a variety of AI workloads, focusing on scheduling GPU resources on Kubernetes, monitoring and health management, and improving reliability and performance of production AI clusters.
Responsibilities
- Build and maintain production systems for large scalable GPU clusters used for AI workloads.
- Develop custom software related to scheduling GPU resources on Kubernetes.
- Implement monitoring and health management capabilities to enable industry-leading reliability, availability, and scalability of GPU assets.
- Harness multiple data streams, including GPU hardware diagnostics, cluster telemetry, and network telemetry.
- Work across NVIDIA teams to ensure production AI clusters run reliably and consistently with maximum performance.
- Evaluate system failures and improve services following a defined incident management process.
Requirements
- 5+ years in a similar software engineering role working on large-scale production systems.
- Direct experience with software development using Kubernetes APIs and frameworks (not just operating a cluster).
- Experience in cluster operations, operator development, and node health monitoring.
- Experience with GPU resource scheduling and managing GPU-based infrastructure.
- Technical knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
- BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree — or equivalent experience.
- Strong communication skills and ability to work with cross-functional teams, principals, and architects across geographies.
Ways to Stand Out
- Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
- Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
- Proven operational excellence in maintaining reliable and performant AI infrastructure.
Compensation & Benefits
- Base salary range:
- Level 3: 144,000 USD - 230,000 USD per year
- Level 4: 168,000 USD - 270,250 USD per year
- Eligible for equity and company benefits (see NVIDIA benefits page).
Additional Information
- Location: Austin, TX, United States.
- Employment type: Full time.
- Applications accepted at least until August 25, 2025.
- NVIDIA is an equal opportunity employer and values workplace diversity.