AI and ML Infra Software Engineer, GPU Clusters - New College Grad 2026
at Nvidia
π Santa Clara, United States
USD 120,000-235,800 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 3 Go @ 5 Kubernetes @ 3 DevOps @ 3 Python @ 5 GCP @ 2 Hiring @ 3 AWS @ 2 Azure @ 2 Bash @ 5 Communication @ 6 Networking @ 3 PyTorch @ 6 GPU @ 3Details
NVIDIA is hiring an AI/ML Infrastructure Software Engineer to join the Hardware Infrastructure team. The role focuses on improving researcher productivity by implementing infrastructure advancements across the stack for GPU clusters to enable AI/ML research at scale. You will work closely with customers (researchers) to identify and resolve infrastructure gaps and build efficient, scalable solutions for accelerated computing and large-scale distributed training.
Responsibilities
- Collaborate closely with AI and ML research teams to understand infrastructure needs and translate them into actionable improvements.
- Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
- Define and improve measures of AI researcher efficiency and ensure actions align with measurable results.
- Work with researchers, data engineers, and DevOps professionals to build a coordinated AI/ML infrastructure ecosystem.
- Stay current with advancements in AI/ML technologies, frameworks, and strategies and promote their implementation.
Requirements
- BS (or equivalent experience) in Computer Science or a related field.
- Proven experience with AI/ML and HPC workloads and infrastructure.
- Hands-on experience operating HPC-grade infrastructure and in-depth knowledge of accelerated computing (GPU/custom silicon), storage systems (Lustre, GPFS, BeeGFS), job scheduling & orchestration (Slurm, Kubernetes, LSF), high-speed networking (InfiniBand, RoCE, Amazon EFA), and container technologies (Docker, Enroot).
- Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
- Proficiency in programming and scripting languages such as Python, Go, and Bash.
- Familiarity with cloud platforms (AWS, GCP, Azure) and parallel computing frameworks/paradigms.
- Strong communication and collaboration skills and a passion for continual learning in AI/ML infrastructure.
Benefits and Compensation
- Base salary range (depends on location and level):
- Level 2: 120,000 USD - 189,750 USD
- Level 3: 148,000 USD - 235,750 USD
- Eligible for equity and comprehensive benefits. See NVIDIA benefits for details.
- Full-time position based in Santa Clara, CA, United States. Applications accepted until October 17, 2025.
Additional Information
- NVIDIA emphasizes diversity and is an equal opportunity employer. The base salary will be determined by location, experience, and internal pay equity.