Senior AI And ML Storage Infra Software Engineer, GPU Clusters

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Docker @ 4 Go @ 6 Kubernetes @ 3 DevOps @ 4 Python @ 6 GCP @ 3 Hiring @ 4 AWS @ 3 Azure @ 3 Bash @ 6 Communication @ 7 Networking @ 4 Performance Optimization @ 4 PyTorch @ 4 GPU @ 4

Details

NVIDIA is hiring an AI/ML Storage Infrastructure Software Engineer to join the Capability Systems team. You will boost researcher productivity by implementing improvements across storage infrastructure tooling and operational practices for GPU clusters, working closely with AI/ML researchers to identify and resolve gaps so innovative AI/ML research can run efficiently and at scale.

Responsibilities

Collaborate closely with AI and ML research teams to understand storage infrastructure and tooling needs and translate those into actionable improvements.
Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
Define and improve measures of AI researcher efficiency focused on storage aspects to ensure actions align with measurable results.
Work with researchers, data engineers, and DevOps professionals to build a coordinated AI/ML infrastructure ecosystem.
Stay current with advancements in AI/ML technologies, frameworks, and strategies and promote their implementation.

Requirements

BS or equivalent experience in Computer Science or related field, with 6+ years of experience in AI/ML and HPC workloads and infrastructure.
Hands-on experience operating HPC-grade infrastructure and deep knowledge of accelerated computing (GPU, custom silicon).
Experience with storage systems such as Lustre, GPFS, BeeGFS.
Familiarity with scheduling & orchestration systems: Slurm, Kubernetes, LSF.
Knowledge of high-speed networking: InfiniBand, RoCE, Amazon EFA.
Experience with container technologies: Docker, Enroot.
Expertise running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX.
Deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
Proficiency in programming and scripting languages such as Python, Go, Bash.
Familiarity with cloud platforms (AWS, GCP, Azure, OCI) and parallel computing frameworks and paradigms.
Strong communication and collaboration skills and a passion for continual learning in AI/ML infrastructure.

Benefits

Competitive base salary (see ranges below) and eligibility for equity and comprehensive benefits.
Opportunity to work on large-scale AI/ML infrastructure at NVIDIA with exposure to researchers and engineering teams.

Compensation & Logistics

Base salary ranges by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
Location: Santa Clara, CA, United States
Employment type: Full time
Applications accepted at least until August 8, 2025

Technologies / Keywords Mentioned

Lust re, GPFS, BeeGFS, Slurm, Kubernetes, LSF, InfiniBand, RoCE, Amazon EFA, Docker, Enroot, PyTorch (DDP, FSDP), NeMo, JAX, Python, Go, Bash, AWS, GCP, Azure, OCI, HPC, GPU, accelerated computing, distributed training, parallel computing, storage performance optimization, AI/ML workflows