Principal AI and ML Infra Software Engineer, GPU Clusters

at Nvidia
USD 272,000-425,500 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Go @ 3 Kubernetes @ 4 DevOps @ 4 Python @ 3 GCP @ 3 AWS @ 3 Azure @ 3 Bash @ 3 Communication @ 7 Networking @ 4 PyTorch @ 4 GPU @ 4

Details

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters to join NVIDIA's Hardware Infrastructure team. You will play a pivotal role in improving researcher efficiency across the entire stack by identifying infrastructure gaps and delivering scalable, high-performance solutions that enable cutting-edge AI/ML research on GPU clusters.

Responsibilities

  • Engage closely with AI and ML research teams to understand infrastructure requirements and blockers, and convert those insights into actionable improvements.
  • Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve them; drive direction and long-term roadmaps for such initiatives.
  • Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
  • Define and improve important measures of AI researcher efficiency and ensure actions align with measurable results.
  • Collaborate with researchers, data engineers, and DevOps professionals to develop a cohesive AI/ML infrastructure ecosystem.
  • Stay current with developments in AI/ML technologies, frameworks, and operational strategies, and advocate for their integration.

Requirements

  • BS in Computer Science or related field (or equivalent experience).
  • 15+ years of demonstrated expertise in AI/ML and High Performance Computing (HPC) tasks and systems.
  • Hands-on experience operating HPC-grade infrastructure and in-depth knowledge of accelerated computing (GPU, custom silicon).
  • Experience with storage systems such as Lustre, GPFS, BeeGFS.
  • Experience with scheduling & orchestration systems (Slurm, Kubernetes, LSF).
  • Knowledge of high-speed networking technologies (Infiniband, RoCE, Amazon EFA).
  • Experience with container technologies (Docker, Enroot).
  • Capability in supervising and improving large-scale distributed training using frameworks such as PyTorch (DDP, FSDP), NeMo, or JAX.
  • Deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
  • Proficiency in programming and scripting (Python, Go, Bash) and familiarity with cloud platforms (AWS, GCP, Azure).
  • Experience with parallel computing frameworks and paradigms.
  • Strong communication and collaboration skills and a dedication to continuous learning.

Benefits & Additional Information

  • NVIDIA offers competitive salaries and a comprehensive benefits package, and eligible candidates may receive equity.
  • Base salary range: 272,000 USD - 425,500 USD (final base salary determined by location, experience, and comparable employee pay).
  • Applications accepted at least until August 30, 2025.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.