Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

at Nvidia
USD 144,000-270,200 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Go @ 6 Kubernetes @ 4 Python @ 6 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Technical Proficiency @ 6 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4

Details

NVIDIA is hiring experienced software engineers with Kubernetes experience to help scale up its AI Infrastructure. The role involves significant software engineering with Kubernetes, including cluster operations, operator development, node health monitoring, and GPU resource scheduling. The position requires creativity, passion for Kubernetes and GPUs, and a strong execution bias.

Responsibilities

  • Work as part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads.
  • Develop custom software related to scheduling GPU resources on Kubernetes.
  • Implement monitoring and health management capabilities to ensure leading reliability, availability, and scalability of GPU assets.
  • Utilize multiple data streams, including GPU hardware diagnostics and cluster/network telemetry.
  • Collaborate with teams across NVIDIA to ensure reliable and consistent production AI cluster performance.
  • Evaluate system failures and improve services following incident management processes.

Requirements

  • Direct software engineering experience with demonstrable impact, especially with Kubernetes APIs and frameworks beyond cluster operation.
  • Strong communication skills to work successfully with multifunctional teams, principals, and architects.
  • 5+ years in a similar role with experience managing large-scale production systems.
  • Excellent knowledge of software engineering principles, tools, and techniques.
  • BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.
  • Technical proficiency in systems programming languages like Go and Python.
  • Solid understanding of data structures and algorithms.

Ways to Stand Out

  • Technical competency in managing and automating large-scale distributed systems, independent of cloud providers.
  • Advanced hands-on experience and deep knowledge of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
  • Proven operational excellence in maintaining reliable and high-performance AI infrastructure.

Benefits

  • Competitive base salary ranging from 144,000 USD to 270,250 USD, dependent on location, experience, and role peers.
  • Eligibility for equity and comprehensive benefits.
  • Equal opportunity employer committed to diversity and inclusion.