GPU and HPC Infrastructure Engineer - New College Grad 2025
at Nvidia
π Santa Clara, United States
USD 104,000-189,800 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 3 System Administration @ 3 Go @ 6 Kubernetes @ 3 Linux @ 3 Python @ 6 Algorithms @ 6 Data Structures @ 6 Distributed Systems @ 3 Machine Learning @ 3 MLOps @ 6 Hiring @ 3 Mathematics @ 3 Networking @ 3 GPU @ 3Details
NVIDIA is hiring engineers to scale up its AI infrastructure. You will work on software and automation that provisions, configures, monitors, and manages GPU assets and large-scale machine learning clusters. The role requires strong programming, systems thinking, and collaboration across hardware and software teams to deliver reliable, scalable datacenter solutions.
Responsibilities
- Contribute to and extend a platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.
- Build end-to-end automation for datacenter operations, break/fix, and lifecycle management for large-scale ML systems.
- Implement monitoring and health management capabilities using multiple data streams (GPU hardware diagnostics, cluster and network telemetry) to ensure high reliability, availability, and scalability of GPU assets.
- Work on software that manages NVLINK topography across GPU clusters.
- Build automated test infrastructure used to qualify distributed systems for operation.
- Collaborate with engineering teams across NVIDIA to integrate software from hardware up to AI training applications.
- Continuously innovate, discover new problems and deliver solutions with a strong execution bias.
Requirements
- Pursuing or recently completed a BS or MS in Computer Science, Computer Engineering, Physics, Mathematics, or comparable degree, or equivalent experience.
- Software engineering experience on large-scale production systems.
- Experience working successfully with cross-functional teams, principals, and architects; ability to coordinate across organizational boundaries and geographies.
- Strong knowledge of a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
- High-level knowledge of Linux system administration and management.
- Understanding of cluster management systems (Kubernetes, SLURM).
- Understanding of performance, security, and reliability in complex distributed systems, including system-level architecture, data synchronization, fault tolerance, and state management.
Ways to stand out
- Experience with High Performance Computing (HPC) and GPUs; deep knowledge of datacenter operations and GPU hardware.
- Hands-on experience with high-performance networking (RDMA, InfiniBand, RoCE).
- Advanced hands-on experience with cluster management systems (Kubernetes, SLURM) and machine learning operations (MLOps).
- Hands-on experience with Bright Cluster Manager and hardware fleet management systems.
- Proven operational excellence in designing and maintaining AI infrastructure.
Compensation & Benefits
- Base salary ranges (by level):
- Level 1: 104000 USD - 172500 USD
- Level 2: 120000 USD - 189750 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits).
Additional details
- Location: Santa Clara, CA, United States.
- Employment type: Full time.
- Application window: Applications accepted at least until October 5, 2025.
- Equal opportunity: NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.