Senior AI-HPC EDA Cluster Engineer

at Nvidia
USD 148,000-287,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 6 CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Rust @ 6 Debugging @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is seeking a highly skilled and experienced AI-HPC Cluster Engineer to design, deploy, and operate GPU compute clusters for EDA and high-performance computing workloads used across multiple teams and projects. You will collaborate with researchers and infrastructure teams to ensure GPU clusters perform efficiently, scale well, and remain reliable.

Responsibilities

  • Provide leadership and strategic mentorship on the management of large-scale HPC systems including deployment of compute, networking, and storage.
  • Develop and improve the ecosystem around GPU-accelerated computing, including developing scalable automation solutions.
  • Continuously improve infrastructure provisioning, management, observability and day-to-day operation through automation.
  • Build and nurture customer and cross-team relationships to consistently support clusters and address changing user needs.
  • Support researchers to run their workloads, including performance analysis and optimizations.
  • Conduct root-cause analysis and suggest corrective action; proactively find and fix issues before they occur.
  • Build innovative tooling to accelerate researchers' velocity, debugging, and software performance at scale.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
  • Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible.
  • Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS or Kubernetes. Applied experience with AI/HPC workflows that use MPI and NCCL.
  • Proficient in Linux (Rocky/CentOS/RHEL and/or Ubuntu).
  • Solid understanding of container technologies such as Enroot and Docker.
  • Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++).
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads; strong problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
  • Excellent communication and collaboration skills.
  • Passion for continual learning and staying ahead of new technologies and approaches in HPC infrastructure.

Ways to stand out

  • Background with NVIDIA GPUs, CUDA programming, NCCL and MLPerf benchmarking.
  • Experience supporting EDA workloads and tools.
  • Familiarity with high-speed networking for HPC including InfiniBand, RDMA and RoCE.
  • Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC.
  • Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana.

Compensation & Benefits

  • Base salary ranges: $148,000 - $235,750 USD for Level 3, and $184,000 - $287,500 USD for Level 4. Your base salary will be determined based on your location, experience, and pay of employees in similar positions.
  • Eligibility for equity and benefits (see NVIDIA benefits page).

Application & Other Info

  • Applications for this job will be accepted at least until September 20, 2025.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.