Vacancy is archived. Applications are no longer accepted.

Senior Site Reliability Engineer - AI Research Clusters

at Nvidia

📍 Santa Clara, United States

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 Docker @ 4 Kubernetes @ 4 MySQL @ 4 Terraform @ 4 Python @ 4 Distributed Systems @ 4 Machine Learning @ 4 TensorFlow @ 3 Bash @ 4 Performance Optimization @ 4 PyTorch @ 3 CUDA @ 3 GPU @ 4

Details

NVIDIA is a leader in AI, machine learning, and datacenter acceleration, continuously reinventing itself since the invention of the GPU in 1999. This role focuses on designing and operating GPU compute clusters that power AI research at NVIDIA, striving for high reliability, efficiency, and performance. This position emphasizes improving automation to boost researchers' productivity while practicing blameless postmortems and proactive outage prevention.

Responsibilities

Design and implement state-of-the-art GPU compute clusters.
Optimize cluster operations for maximum reliability, efficiency, and performance.
Drive foundational improvements and automation to enhance researcher productivity.
Address strategic challenges in large-scale, high-performance computing environments.
Troubleshoot, diagnose and root cause system failures in collaboration with internal and external partners.
Scale systems sustainably through automation and advocate for improvements that boost reliability and velocity.
Participate in sustainable incident response and blameless postmortems.
Join on-call rotations to support production systems.
Write and review code, develop documentation, capacity plans, and debug complex system issues.
Implement remediations across software and hardware stacks with thorough records.
Manage updates and automated rollbacks across clusters.

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience with at least 6 years designing and operating large-scale compute infrastructure.
Proven site reliability engineering experience for high-performance computing environments, managing clusters with at least 5,000 GPUs.
Deep understanding of GPU computing and AI infrastructure.
Passion for complex technical challenge solving and system performance optimization.
Experience with AI/HPC job schedulers (e.g., Slurm).
Solid experience with GPU clusters and configuration management tools (e.g., BCM, Ansible), infrastructure tools like Kubernetes, Terraform, MySQL.
In-depth knowledge of container technologies such as Docker and Enroot.
Programming experience in Python and Bash scripting.

Preferred Qualifications

Interest in analyzing and fixing large-scale distributed systems.
Familiarity with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
Knowledge of InfiniBand with IBoIP and RDMA.
Experience with cloud deployment, BCM, and Terraform.
Understanding of fast, distributed storage systems such as Lustre and GPFS.
Familiarity with deep learning frameworks like PyTorch and TensorFlow.
Multi-cloud experience.

Benefits

NVIDIA offers highly competitive salaries along with a comprehensive benefits package and equity opportunities. The company values diversity and offers an inclusive environment fostering intellectual curiosity and growth.

#LI-Hybrid