Senior AI-HPC Cluster Engineer - MLOps

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Algorithms @ 4 Machine Learning @ 4 TensorFlow @ 4 Hiring @ 4 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Rust @ 6 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is hiring an experienced engineer to design and implement GPU compute clusters for deep learning and high-performance computing (HPC). The role focuses on operating and scaling large HPC/AI compute infrastructure, building automation and tooling, supporting researchers running AI/HPC workloads, and driving performance analysis and optimization.

Responsibilities

  • Provide leadership and strategic mentorship on managing large-scale HPC systems, including deployment of compute, networking, and storage.
  • Develop and improve the ecosystem around GPU-accelerated computing, including building scalable automation solutions.
  • Build and nurture customer and cross-team relationships to support clusters and address changing user needs.
  • Support researchers running workloads, including performance analysis and optimizations.
  • Conduct root cause analysis and suggest corrective actions; proactively find and fix issues before they occur.
  • Build innovative tooling to accelerate researcher velocity, troubleshooting, and software performance at scale.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field, or equivalent experience.
  • Minimum of 6 years of experience crafting and operating large-scale compute infrastructure.
  • Experience with AI/HPC job schedulers and orchestrators such as Slurm, Kubernetes (K8s), or LSF. Applied experience with AI/HPC workflows that use MPI and NCCL.
  • Proficient in Linux (CentOS/RHEL and/or Ubuntu distributions).
  • Solid understanding of container technologies (Enroot, Docker, Podman).
  • Proficiency in one scripting language (Python or Bash) and at least one compiled language (Golang, Rust, C, C++).
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads; strong problem-solving skills to analyze complex systems, identify bottlenecks, and implement scalable solutions.
  • Excellent communication and teamwork skills.
  • Passion for continual learning and staying current with technologies and approaches in HPC and AI/ML infrastructure.

Preferred / Nice to Have

  • Experience with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
  • Experience with machine learning and deep learning concepts, algorithms, and models.
  • Familiarity with high-speed networking for HPC (InfiniBand, RDMA, RoCE) and Amazon EFA.
  • Understanding of distributed storage systems for AI/HPC (Lustre, GPFS).
  • Experience with deep learning frameworks such as PyTorch, MegatronLM, and TensorFlow.
  • Familiarity with metrics collection and visualization at scale (Prometheus, OpenSearch, Grafana).

Benefits

  • Competitive base salary. Base salary ranges provided by level: Level 4 — 184,000 USD to 287,500 USD; Level 5 — 224,000 USD to 356,500 USD.
  • Eligible for equity and additional benefits (see NVIDIA benefits).
  • NVIDIA is an equal opportunity employer committed to a diverse work environment.

Application deadline: Applications accepted at least until August 3, 2025.