Senior AI-HPC Cluster Engineer - MLOps

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Algorithms @ 4 Machine Learning @ 4 TensorFlow @ 4 Hiring @ 4 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Rust @ 6 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is hiring an experienced engineer to design and implement GPU compute clusters for deep learning and high-performance computing (HPC). The role focuses on operating and scaling large HPC/AI compute infrastructure, building automation and tooling, supporting researchers running AI/HPC workloads, and driving performance analysis and optimization.

Responsibilities

Provide leadership and strategic mentorship on managing large-scale HPC systems, including deployment of compute, networking, and storage.
Develop and improve the ecosystem around GPU-accelerated computing, including building scalable automation solutions.
Build and nurture customer and cross-team relationships to support clusters and address changing user needs.
Support researchers running workloads, including performance analysis and optimizations.
Conduct root cause analysis and suggest corrective actions; proactively find and fix issues before they occur.
Build innovative tooling to accelerate researcher velocity, troubleshooting, and software performance at scale.

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering or related field, or equivalent experience.
Minimum of 6 years of experience crafting and operating large-scale compute infrastructure.
Experience with AI/HPC job schedulers and orchestrators such as Slurm, Kubernetes (K8s), or LSF. Applied experience with AI/HPC workflows that use MPI and NCCL.
Proficient in Linux (CentOS/RHEL and/or Ubuntu distributions).
Solid understanding of container technologies (Enroot, Docker, Podman).
Proficiency in one scripting language (Python or Bash) and at least one compiled language (Golang, Rust, C, C++).
Experience analyzing and tuning performance for a variety of AI/HPC workloads; strong problem-solving skills to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and teamwork skills.
Passion for continual learning and staying current with technologies and approaches in HPC and AI/ML infrastructure.

Preferred / Nice to Have

Experience with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
Experience with machine learning and deep learning concepts, algorithms, and models.
Familiarity with high-speed networking for HPC (InfiniBand, RDMA, RoCE) and Amazon EFA.
Understanding of distributed storage systems for AI/HPC (Lustre, GPFS).
Experience with deep learning frameworks such as PyTorch, MegatronLM, and TensorFlow.
Familiarity with metrics collection and visualization at scale (Prometheus, OpenSearch, Grafana).

Benefits

Competitive base salary. Base salary ranges provided by level: Level 4 — 184,000 USD to 287,500 USD; Level 5 — 224,000 USD to 356,500 USD.
Eligible for equity and additional benefits (see NVIDIA benefits).
NVIDIA is an equal opportunity employer committed to a diverse work environment.

Application deadline: Applications accepted at least until August 3, 2025.