GPU and HPC Infrastructure Engineer - New College Grad 2025

at Nvidia

📍 Santa Clara, United States

USD 104,000-189,800 per year

JUNIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 3 System Administration @ 3 Go @ 6 Kubernetes @ 3 Linux @ 3 Python @ 6 Algorithms @ 6 Data Structures @ 6 Distributed Systems @ 3 Machine Learning @ 3 MLOps @ 6 Hiring @ 3 Mathematics @ 3 Networking @ 3 GPU @ 3

Details

NVIDIA is hiring engineers to scale up its AI infrastructure. You will work on software and automation that provisions, configures, monitors, and manages GPU assets and large-scale machine learning clusters. The role requires strong programming, systems thinking, and collaboration across hardware and software teams to deliver reliable, scalable datacenter solutions.

Responsibilities

Contribute to and extend a platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.
Build end-to-end automation for datacenter operations, break/fix, and lifecycle management for large-scale ML systems.
Implement monitoring and health management capabilities using multiple data streams (GPU hardware diagnostics, cluster and network telemetry) to ensure high reliability, availability, and scalability of GPU assets.
Work on software that manages NVLINK topography across GPU clusters.
Build automated test infrastructure used to qualify distributed systems for operation.
Collaborate with engineering teams across NVIDIA to integrate software from hardware up to AI training applications.
Continuously innovate, discover new problems and deliver solutions with a strong execution bias.

Requirements

Pursuing or recently completed a BS or MS in Computer Science, Computer Engineering, Physics, Mathematics, or comparable degree, or equivalent experience.
Software engineering experience on large-scale production systems.
Experience working successfully with cross-functional teams, principals, and architects; ability to coordinate across organizational boundaries and geographies.
Strong knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
High-level knowledge of Linux system administration and management.
Understanding of cluster management systems (Kubernetes, SLURM).
Understanding of performance, security, and reliability in complex distributed systems, including system-level architecture, data synchronization, fault tolerance, and state management.

Ways to stand out

Experience with High Performance Computing (HPC) and GPUs; deep knowledge of datacenter operations and GPU hardware.
Hands-on experience with high-performance networking (RDMA, InfiniBand, RoCE).
Advanced hands-on experience with cluster management systems (Kubernetes, SLURM) and machine learning operations (MLOps).
Hands-on experience with Bright Cluster Manager and hardware fleet management systems.
Proven operational excellence in designing and maintaining AI infrastructure.

Compensation & Benefits

Base salary ranges (by level):
- Level 1: 104000 USD - 172500 USD
- Level 2: 120000 USD - 189750 USD
You will also be eligible for equity and benefits (see NVIDIA benefits).

Additional details

Location: Santa Clara, CA, United States.
Employment type: Full time.
Application window: Applications accepted at least until October 5, 2025.
Equal opportunity: NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.