Principal AI and ML Infra Software Engineer, GPU Clusters
at Nvidia
π Santa Clara, United States
USD 272,000-425,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 4 Go @ 3 Kubernetes @ 4 DevOps @ 4 Python @ 3 GCP @ 3 AWS @ 3 Azure @ 3 Bash @ 3 Communication @ 7 Networking @ 4 PyTorch @ 4 GPU @ 4Details
We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters to join NVIDIA's Hardware Infrastructure team. You will play a pivotal role in improving researcher efficiency across the entire stack by identifying infrastructure gaps and delivering scalable, high-performance solutions that enable cutting-edge AI/ML research on GPU clusters.
Responsibilities
- Engage closely with AI and ML research teams to understand infrastructure requirements and blockers, and convert those insights into actionable improvements.
- Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve them; drive direction and long-term roadmaps for such initiatives.
- Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
- Define and improve important measures of AI researcher efficiency and ensure actions align with measurable results.
- Collaborate with researchers, data engineers, and DevOps professionals to develop a cohesive AI/ML infrastructure ecosystem.
- Stay current with developments in AI/ML technologies, frameworks, and operational strategies, and advocate for their integration.
Requirements
- BS in Computer Science or related field (or equivalent experience).
- 15+ years of demonstrated expertise in AI/ML and High Performance Computing (HPC) tasks and systems.
- Hands-on experience operating HPC-grade infrastructure and in-depth knowledge of accelerated computing (GPU, custom silicon).
- Experience with storage systems such as Lustre, GPFS, BeeGFS.
- Experience with scheduling & orchestration systems (Slurm, Kubernetes, LSF).
- Knowledge of high-speed networking technologies (Infiniband, RoCE, Amazon EFA).
- Experience with container technologies (Docker, Enroot).
- Capability in supervising and improving large-scale distributed training using frameworks such as PyTorch (DDP, FSDP), NeMo, or JAX.
- Deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
- Proficiency in programming and scripting (Python, Go, Bash) and familiarity with cloud platforms (AWS, GCP, Azure).
- Experience with parallel computing frameworks and paradigms.
- Strong communication and collaboration skills and a dedication to continuous learning.
Benefits & Additional Information
- NVIDIA offers competitive salaries and a comprehensive benefits package, and eligible candidates may receive equity.
- Base salary range: 272,000 USD - 425,500 USD (final base salary determined by location, experience, and comparable employee pay).
- Applications accepted at least until August 30, 2025.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.