Senior ML Storage Engineer - GPU Clusters
at Nvidia
📍 Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 6 Docker @ 6 Go @ 4 Kubernetes @ 6 Ruby @ 4 IaC @ 6 Terraform @ 6 Python @ 4 GCP @ 4 CI/CD @ 6 Algorithms @ 4 Machine Learning @ 4 AWS @ 4 Azure @ 4 Communication @ 7 Networking @ 4 GPU @ 4Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. As a member of the team you will work on designing, deploying, and managing high-speed storage offerings in large-scale GPU clusters that power AI workloads across multiple teams and projects.
Responsibilities
- Research and implement distributed storage services.
- Design and implement scalable and efficient storage solutions tailored for data‑intensive AI applications, optimizing performance and cost‑effectiveness.
- Continuously improve storage infrastructure provisioning, management, observability, and day‑to‑day operations through automation.
- Ensure high uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
- Support globally distributed on‑premise and cloud environments such as AWS, GCP, Azure, or OCI.
- Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
- Produce high‑quality Root Cause Analysis (RCA) reports for production incidents and work to prevent recurrences.
- Support researchers running flows on clusters, including performance analysis and optimizations of deep learning workflows.
- Participate in the team’s on‑call rotation to support critical infrastructure.
- Drive evaluation and integration of storage solutions with new GPU technologies (e.g., GB200) and cloud technologies to improve system performance.
Requirements
- Minimum BS degree in Computer Science (or equivalent experience) with 6+ years managing high‑speed storage solutions deployed for GPU clusters or similar high‑performance computing environments.
- Expertise in designing, deploying, and running production‑level cloud services.
- Experience with one or more parallel or distributed filesystems such as Lustre or GPFS, including analyzing and tuning performance for AI/HPC workloads.
- Experience in architecture design and operation of storage solutions on leading cloud environments (AWS, Azure, or GCP).
- Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
- Experience coding/scripting in at least two high‑level programming languages (examples: Python, Go, Ruby).
- Proficient in modern CI/CD techniques and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
- Strong communication and documentation skills; diligence and focus on operational excellence.
Ways to stand out
- Experience running large‑scale Slurm, LSF and/or BCM deployments in production.
- Expertise in modern container networking and storage architecture.
- Experience with machine learning and deep learning concepts, algorithms, and models.
- Track record of defining and driving operational excellence in highly distributed, high‑performance environments.
Compensation & Benefits
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.
Additional information
- Applications for this job will be accepted at least until August 5, 2025.
- NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. NVIDIA does not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.