Senior AI-HPC Cluster Engineer

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 7 CentOS @ 6 Docker @ 4 Kubernetes @ 4 Linux @ 6 Python @ 6 Algorithms @ 4 Machine Learning @ 4 TensorFlow @ 4 Leadership @ 4 Bash @ 6 Communication @ 7 Networking @ 4 PyTorch @ 4 Puppet @ 7 Salt @ 7 CUDA @ 4 GPU @ 4

Details

NVIDIA is a leading technology company reinventing GPU computing and artificial intelligence. As part of the GPU AI/HPC Infrastructure team, you will lead design and implementation of advanced GPU compute clusters for deep learning, HPC, and computational workloads. This role requires strategic leadership to tackle architecture, networking, storage, and scalable automation challenges in a global compute environment.

Responsibilities

Lead management of large-scale HPC systems including compute, networking, and storage deployment.
Develop and enhance GPU-accelerated computing ecosystems and scalable automation.
Build and maintain AI and ML heterogeneous clusters on-premises and cloud.
Foster customer and cross-team relationships to sustain and evolve clusters.
Support researchers with workload performance analysis and optimization.
Perform root cause analysis and proactively address issues.

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering, or equivalent experience.
Minimum 8 years designing and operating large scale compute infrastructure.
Experience with AI/HPC job schedulers such as Slurm, Kubernetes, RTDA, or LSF.
Proficient in CentOS/RHEL and/or Ubuntu Linux administration.
Strong understanding of configuration management tools (Ansible, Puppet, Salt).
Expertise in container technologies: Docker, Singularity, Podman, Shifter, Charliecloud.
Proficiency in Python programming and bash scripting.
Excellent problem-solving skills for complex system analysis and scalable solutions.
Experienced in AI/HPC workflows using MPI.
Strong communication and teamwork skills.
Capable of analyzing and tuning AI/HPC workloads performance.
Passionate about continual learning and emerging HPC/AI infrastructure tech.

Ways to Stand Out

Experience with NVIDIA GPUs, CUDA programming, NCCL, MLPerf benchmarking.
Knowledge of machine learning and deep learning algorithms and models.
Familiarity with InfiniBand, IBOP, RDMA.
Understanding of distributed storage systems like Lustre and GPFS.
Knowledge of deep learning frameworks like PyTorch and TensorFlow.

Benefits

NVIDIA offers competitive salaries along with equity and comprehensive benefits. The company fosters diversity and is an equal opportunity employer aiming for an inclusive workforce.

Salary range: 184,000 USD - 356,500 USD per year, based on location, experience, and comparable roles.