Principal AI and ML Infra Software Engineer, GPU Clusters

at Nvidia
USD 272,000-425,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Go @ 3 Kubernetes @ 4 DevOps @ 4 Python @ 3 GCP @ 3 AWS @ 3 Azure @ 3 Bash @ 3 Communication @ 4 Networking @ 4 PyTorch @ 4 GPU @ 4

Details

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology!

Responsibilities

  • Engage closely with AI and ML research teams to discern infrastructure requirements and barriers, converting those insights into actionable improvements.
  • Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve them; drive direction and long-term roadmaps for such initiatives.
  • Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
  • Help define and improve measures of AI researcher efficiency, ensuring that actions are aligned with measurable results.
  • Work closely with researchers, data engineers, and DevOps professionals to develop a cohesive AI/ML infrastructure ecosystem.
  • Keep up to date with recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.

Requirements

  • BS or similar background in Computer Science or related area (or equivalent experience).
  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
  • Hands-on experience operating High Performance Computing (HPC) grade infrastructure and in-depth knowledge of accelerated computing (GPU, custom silicon), storage systems (Lustre, GPFS, BeeGFS), scheduling & orchestration (Slurm, Kubernetes, LSF), high-speed networking (InfiniBand, RoCE, Amazon EFA), and container technologies (Docker, Enroot).
  • Experience supervising and improving large-scale distributed training using PyTorch (DDP, FSDP), NeMo, or JAX, and an in-depth understanding of AI/ML workflows including data processing, model training, and inference pipelines.
  • Proficiency in programming and scripting languages such as Python, Go, and Bash, plus familiarity with cloud platforms (AWS, GCP, Azure) and parallel computing frameworks and paradigms.
  • Dedication to continuous learning and staying updated on new technologies and methods in AI/ML infrastructure.
  • Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.

Benefits

  • Competitive base salary (range provided below) determined by location, experience, and pay of employees in similar positions.
  • Eligibility for equity and a comprehensive benefits package. See NVIDIA benefits for details.
  • Applications accepted at least until August 30, 2025.

Compensation

  • Base salary range: 272,000 USD - 425,500 USD (location- and experience-dependent).