Senior Site Reliability Storage Engineer - GPU Clusters
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 6 Docker @ 6 Go @ 4 Kubernetes @ 6 Ruby @ 4 IaC @ 6 Terraform @ 6 Python @ 4 GCP @ 4 CI/CD @ 6 Algorithms @ 3 Machine Learning @ 3 AWS @ 4 Azure @ 4 Communication @ 7 Networking @ 4 GPU @ 4Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years, pioneering innovation fueled by great technology and amazing people. Today, NVIDIA is defining the next era of computing powered by AI, where GPUs act as the brains of computers, robots, and self-driving cars that understand the world. Working at NVIDIA means immersion in a diverse, supportive environment inspiring employees to do their best work and make a lasting impact.
Responsibilities
- Research and implement distributed storage services.
- Design and implement scalable, efficient storage solutions tailored for data-intensive AI applications with optimized performance and cost-effectiveness.
- Continuously improve storage infrastructure provisioning, management, observability, and day-to-day operations through automation.
- Ensure the highest levels of uptime and quality of service (QoS) via operational excellence, proactive monitoring, and incident resolution.
- Support globally distributed on-premise and cloud environments like AWS, GCP, Azure, or OCI.
- Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
- Write high-quality Root Cause Analysis (RCA) reports for production incidents and work to prevent future occurrences.
- Support researchers running flows on clusters including performance analysis and optimizations of deep learning workflows; participate in on-call rotation for critical infrastructure support.
- Drive evaluation and integration of storage solutions with new GPU technologies (e.g., GB200) and cloud technologies to improve system performance.
Requirements
- Minimum BS degree in Computer Science or equivalent experience.
- 6+ years managing high-speed storage solutions for GPU clusters or similar high-performance computing environments.
- Expertise in designing, deploying, and running production-level cloud services.
- Experience with parallel or distributed filesystems such as Lustre, GPFS, including performance tuning for AI/HPC workloads.
- Experience in architecture, design, and operation of storage solutions on leading cloud environments (AWS, Azure, GCP).
- Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
- Coding/scripting experience in at least two high-level languages (e.g., Python, Go, Ruby).
- Proficient with modern CI/CD practices and Infrastructure as Code (IaC) tools such as Terraform or Ansible.
- Strong communication and documentation skills.
Ways to Stand Out
- Experience running large-scale Slurm/LSF and/or BCM deployments in production.
- Expertise in modern container networking and storage architecture.
- Familiarity with Machine Learning and Deep Learning concepts, algorithms, and models.
- Proven record of driving operational excellence in distributed, high-performance environments.
Benefits
NVIDIA offers competitive salaries, comprehensive benefits, equity eligibility, and a diverse and inclusive work environment. The company is rapidly expanding its engineering teams due to exceptional growth and values hardworking, independent engineers passionate about technology.
Applications accepted until August 3, 2025.