AI and ML Infra Software Engineer, GPU Clusters - New College Grad 2026

at Nvidia
USD 120,000-235,800 per year
JUNIOR MIDDLE
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 3 Go @ 5 Kubernetes @ 3 DevOps @ 3 Python @ 5 GCP @ 2 Hiring @ 3 AWS @ 2 Azure @ 2 Bash @ 5 Communication @ 6 Networking @ 3 PyTorch @ 6 GPU @ 3

Details

NVIDIA is hiring an AI/ML Infrastructure Software Engineer to join the Hardware Infrastructure team. The role focuses on improving researcher productivity by implementing infrastructure advancements across the stack for GPU clusters to enable AI/ML research at scale. You will work closely with customers (researchers) to identify and resolve infrastructure gaps and build efficient, scalable solutions for accelerated computing and large-scale distributed training.

Responsibilities

  • Collaborate closely with AI and ML research teams to understand infrastructure needs and translate them into actionable improvements.
  • Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
  • Define and improve measures of AI researcher efficiency and ensure actions align with measurable results.
  • Work with researchers, data engineers, and DevOps professionals to build a coordinated AI/ML infrastructure ecosystem.
  • Stay current with advancements in AI/ML technologies, frameworks, and strategies and promote their implementation.

Requirements

  • BS (or equivalent experience) in Computer Science or a related field.
  • Proven experience with AI/ML and HPC workloads and infrastructure.
  • Hands-on experience operating HPC-grade infrastructure and in-depth knowledge of accelerated computing (GPU/custom silicon), storage systems (Lustre, GPFS, BeeGFS), job scheduling & orchestration (Slurm, Kubernetes, LSF), high-speed networking (InfiniBand, RoCE, Amazon EFA), and container technologies (Docker, Enroot).
  • Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
  • Proficiency in programming and scripting languages such as Python, Go, and Bash.
  • Familiarity with cloud platforms (AWS, GCP, Azure) and parallel computing frameworks/paradigms.
  • Strong communication and collaboration skills and a passion for continual learning in AI/ML infrastructure.

Benefits and Compensation

  • Base salary range (depends on location and level):
    • Level 2: 120,000 USD - 189,750 USD
    • Level 3: 148,000 USD - 235,750 USD
  • Eligible for equity and comprehensive benefits. See NVIDIA benefits for details.
  • Full-time position based in Santa Clara, CA, United States. Applications accepted until October 17, 2025.

Additional Information

  • NVIDIA emphasizes diversity and is an equal opportunity employer. The base salary will be determined by location, experience, and internal pay equity.