Senior AI And ML Storage Infra Software Engineer, GPU Clusters
at Nvidia
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 4 Go @ 6 Kubernetes @ 3 DevOps @ 4 Python @ 6 GCP @ 3 Hiring @ 4 AWS @ 3 Azure @ 3 Bash @ 6 Communication @ 7 Networking @ 4 Performance Optimization @ 4 PyTorch @ 4 GPU @ 4Details
NVIDIA is hiring an AI/ML Storage Infrastructure Software Engineer to join the Capability Systems team. You will boost researcher productivity by implementing improvements across storage infrastructure tooling and operational practices for GPU clusters, working closely with AI/ML researchers to identify and resolve gaps so innovative AI/ML research can run efficiently and at scale.
Responsibilities
- Collaborate closely with AI and ML research teams to understand storage infrastructure and tooling needs and translate those into actionable improvements.
- Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
- Define and improve measures of AI researcher efficiency focused on storage aspects to ensure actions align with measurable results.
- Work with researchers, data engineers, and DevOps professionals to build a coordinated AI/ML infrastructure ecosystem.
- Stay current with advancements in AI/ML technologies, frameworks, and strategies and promote their implementation.
Requirements
- BS or equivalent experience in Computer Science or related field, with 6+ years of experience in AI/ML and HPC workloads and infrastructure.
- Hands-on experience operating HPC-grade infrastructure and deep knowledge of accelerated computing (GPU, custom silicon).
- Experience with storage systems such as Lustre, GPFS, BeeGFS.
- Familiarity with scheduling & orchestration systems: Slurm, Kubernetes, LSF.
- Knowledge of high-speed networking: InfiniBand, RoCE, Amazon EFA.
- Experience with container technologies: Docker, Enroot.
- Expertise running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX.
- Deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
- Proficiency in programming and scripting languages such as Python, Go, Bash.
- Familiarity with cloud platforms (AWS, GCP, Azure, OCI) and parallel computing frameworks and paradigms.
- Strong communication and collaboration skills and a passion for continual learning in AI/ML infrastructure.
Benefits
- Competitive base salary (see ranges below) and eligibility for equity and comprehensive benefits.
- Opportunity to work on large-scale AI/ML infrastructure at NVIDIA with exposure to researchers and engineering teams.
Compensation & Logistics
- Base salary ranges by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- Location: Santa Clara, CA, United States
- Employment type: Full time
- Applications accepted at least until August 8, 2025
Technologies / Keywords Mentioned
Lust re, GPFS, BeeGFS, Slurm, Kubernetes, LSF, InfiniBand, RoCE, Amazon EFA, Docker, Enroot, PyTorch (DDP, FSDP), NeMo, JAX, Python, Go, Bash, AWS, GCP, Azure, OCI, HPC, GPU, accelerated computing, distributed training, parallel computing, storage performance optimization, AI/ML workflows