GPU and HPC Infrastructure Engineer - New College Grad 2025

at Nvidia
USD 104,000-189,800 per year
JUNIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 3 System Administration @ 3 Go @ 6 Kubernetes @ 3 Linux @ 3 Python @ 6 Algorithms @ 6 Data Structures @ 6 Distributed Systems @ 3 Machine Learning @ 3 MLOps @ 6 Hiring @ 3 Mathematics @ 3 Networking @ 3 GPU @ 3

Details

NVIDIA is hiring engineers to scale up its AI infrastructure. You will work on software and automation that provisions, configures, monitors, and manages GPU assets and large-scale machine learning clusters. The role requires strong programming, systems thinking, and collaboration across hardware and software teams to deliver reliable, scalable datacenter solutions.

Responsibilities

  • Contribute to and extend a platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.
  • Build end-to-end automation for datacenter operations, break/fix, and lifecycle management for large-scale ML systems.
  • Implement monitoring and health management capabilities using multiple data streams (GPU hardware diagnostics, cluster and network telemetry) to ensure high reliability, availability, and scalability of GPU assets.
  • Work on software that manages NVLINK topography across GPU clusters.
  • Build automated test infrastructure used to qualify distributed systems for operation.
  • Collaborate with engineering teams across NVIDIA to integrate software from hardware up to AI training applications.
  • Continuously innovate, discover new problems and deliver solutions with a strong execution bias.

Requirements

  • Pursuing or recently completed a BS or MS in Computer Science, Computer Engineering, Physics, Mathematics, or comparable degree, or equivalent experience.
  • Software engineering experience on large-scale production systems.
  • Experience working successfully with cross-functional teams, principals, and architects; ability to coordinate across organizational boundaries and geographies.
  • Strong knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
  • High-level knowledge of Linux system administration and management.
  • Understanding of cluster management systems (Kubernetes, SLURM).
  • Understanding of performance, security, and reliability in complex distributed systems, including system-level architecture, data synchronization, fault tolerance, and state management.

Ways to stand out

  • Experience with High Performance Computing (HPC) and GPUs; deep knowledge of datacenter operations and GPU hardware.
  • Hands-on experience with high-performance networking (RDMA, InfiniBand, RoCE).
  • Advanced hands-on experience with cluster management systems (Kubernetes, SLURM) and machine learning operations (MLOps).
  • Hands-on experience with Bright Cluster Manager and hardware fleet management systems.
  • Proven operational excellence in designing and maintaining AI infrastructure.

Compensation & Benefits

  • Base salary ranges (by level):
    • Level 1: 104000 USD - 172500 USD
    • Level 2: 120000 USD - 189750 USD
  • You will also be eligible for equity and benefits (see NVIDIA benefits).

Additional details

  • Location: Santa Clara, CA, United States.
  • Employment type: Full time.
  • Application window: Applications accepted at least until October 5, 2025.
  • Equal opportunity: NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.