Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia
USD 144,000-270,200 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 4 Go @ 4 Kubernetes @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4

Details

NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale up its AI infrastructure. You will work on production systems that enable large scalable GPU clusters for a variety of AI workloads, focusing on scheduling GPU resources on Kubernetes, monitoring and health management, and improving reliability and performance of production AI clusters.

Responsibilities

  • Build and maintain production systems for large scalable GPU clusters used for AI workloads.
  • Develop custom software related to scheduling GPU resources on Kubernetes.
  • Implement monitoring and health management capabilities to enable industry-leading reliability, availability, and scalability of GPU assets.
  • Harness multiple data streams, including GPU hardware diagnostics, cluster telemetry, and network telemetry.
  • Work across NVIDIA teams to ensure production AI clusters run reliably and consistently with maximum performance.
  • Evaluate system failures and improve services following a defined incident management process.

Requirements

  • 5+ years in a similar software engineering role working on large-scale production systems.
  • Direct experience with software development using Kubernetes APIs and frameworks (not just operating a cluster).
  • Experience in cluster operations, operator development, and node health monitoring.
  • Experience with GPU resource scheduling and managing GPU-based infrastructure.
  • Technical knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
  • BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree — or equivalent experience.
  • Strong communication skills and ability to work with cross-functional teams, principals, and architects across geographies.

Ways to Stand Out

  • Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
  • Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
  • Proven operational excellence in maintaining reliable and performant AI infrastructure.

Compensation & Benefits

  • Base salary range:
    • Level 3: 144,000 USD - 230,000 USD per year
    • Level 4: 168,000 USD - 270,250 USD per year
  • Eligible for equity and company benefits (see NVIDIA benefits page).

Additional Information

  • Location: Austin, TX, United States.
  • Employment type: Full time.
  • Applications accepted at least until August 25, 2025.
  • NVIDIA is an equal opportunity employer and values workplace diversity.