Senior Site Reliability Engineer - AI Research Clusters

at Nvidia

📍 Santa Clara, United States

$180,000-339,200 per year

SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Docker @ 4 Kubernetes @ 4 MySQL @ 4 Terraform @ 4 Python @ 4 Distributed Systems @ 4 Machine Learning @ 4 Leadership @ 4 Bash @ 4 Networking @ 4 Debugging @ 4

Details

NVIDIA is the leader in AI, machine learning, and datacenter acceleration. The company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs, continually reinventing itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI—the next era of computing. This is our life’s work, to amplify human imagination and intelligence. Make the choice, join our diverse team today!

Responsibilities

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that power all AI research across NVIDIA. You will be responsible for:

  • Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions.
  • Maintaining and building deep learning AI-HPC GPU clusters at scale and supporting researchers to run their workflows on our clusters, which includes performance analysis and optimizations of deep learning workflows.
  • Designing, implementing, and supporting operational and reliability aspects of large-scale distributed systems with a focus on performance at scale, real-time monitoring, logging, and alerting.
  • Optimizing cluster operations for maximum reliability, efficiency, and performance.
  • Troubleshooting, diagnosing, and root-cause analysis of system failures and isolating the components/failure scenarios while collaborating with internal and external partners.
  • Scaling systems sustainably through mechanisms like automation and evolving systems by proposing changes that improve reliability and velocity.
  • Practicing sustainable incident response and blameless postmortems.
  • Being part of an on-call rotation to support production systems.
  • Writing and reviewing code, developing documentation and capacity plans, and debugging complex problems on some of the largest and most complex systems in the world.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience with a minimum of 6+ years of experience designing and operating large-scale compute infrastructure.
  • Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least a 5K GPUs cluster.
  • Deep understanding of GPU computing and AI infrastructure.
  • Passion for solving complex technical challenges and optimizing system performance.
  • Experience with AI/HPC advanced job schedulers, familiarity with schedulers such as Slurm is ideal.
  • Solid experience with GPU clusters and working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure-level applications, such as Kubernetes, Terraform, MySQL, etc.
  • In-depth understanding of container technologies like Docker and Enroot.
  • Experience programming in Python and Bash scripting.

Benefits

NVIDIA offers highly competitive salaries and a comprehensive benefits package. The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.