AI and ML Infra Software Engineer, GPU Clusters - New College Grad 2026

at Nvidia

📍 Santa Clara, United States

USD 120,000-235,800 per year

JUNIOR MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Docker @ 3 Go @ 5 Kubernetes @ 3 DevOps @ 3 Python @ 5 GCP @ 2 Hiring @ 3 AWS @ 2 Azure @ 2 Bash @ 5 Communication @ 6 Networking @ 3 PyTorch @ 6 GPU @ 3

Details

NVIDIA is hiring an AI/ML Infrastructure Software Engineer to join the Hardware Infrastructure team. The role focuses on improving researcher productivity by implementing infrastructure advancements across the stack for GPU clusters to enable AI/ML research at scale. You will work closely with customers (researchers) to identify and resolve infrastructure gaps and build efficient, scalable solutions for accelerated computing and large-scale distributed training.

Responsibilities

Collaborate closely with AI and ML research teams to understand infrastructure needs and translate them into actionable improvements.
Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
Define and improve measures of AI researcher efficiency and ensure actions align with measurable results.
Work with researchers, data engineers, and DevOps professionals to build a coordinated AI/ML infrastructure ecosystem.
Stay current with advancements in AI/ML technologies, frameworks, and strategies and promote their implementation.

Requirements

BS (or equivalent experience) in Computer Science or a related field.
Proven experience with AI/ML and HPC workloads and infrastructure.
Hands-on experience operating HPC-grade infrastructure and in-depth knowledge of accelerated computing (GPU/custom silicon), storage systems (Lustre, GPFS, BeeGFS), job scheduling & orchestration (Slurm, Kubernetes, LSF), high-speed networking (InfiniBand, RoCE, Amazon EFA), and container technologies (Docker, Enroot).
Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
Proficiency in programming and scripting languages such as Python, Go, and Bash.
Familiarity with cloud platforms (AWS, GCP, Azure) and parallel computing frameworks/paradigms.
Strong communication and collaboration skills and a passion for continual learning in AI/ML infrastructure.

Benefits and Compensation

Base salary range (depends on location and level):
- Level 2: 120,000 USD - 189,750 USD
- Level 3: 148,000 USD - 235,750 USD
Eligible for equity and comprehensive benefits. See NVIDIA benefits for details.
Full-time position based in Santa Clara, CA, United States. Applications accepted until October 17, 2025.

Additional Information

NVIDIA emphasizes diversity and is an equal opportunity employer. The base salary will be determined by location, experience, and internal pay equity.