Senior HPC Cluster Engineer

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Ansible @ 6 CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Debugging @ 4 Technical Leadership @ 4 CUDA @ 4 GPU @ 4 Observability @ 4 AI @ 4 InfiniBand @ 3 NCCL @ 4 Slurm @ 4 HPC @ 4 Performance Analysis @ 4

Details

NVIDIA is seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU compute clusters for EDA (Electronic Design Automation) and high-performance computing workloads used across multiple teams and projects. You will collaborate with researchers and infrastructure teams to ensure clusters are performant, scalable, and reliable.

Responsibilities

  • Develop and enhance the ecosystem around GPU-accelerated computing, including building scalable automation solutions.
  • Continuously improve infrastructure provisioning, management, observability, and day-to-day operations through automation.
  • Provide technical leadership and strategic guidance for managing large-scale HPC systems, including deployment of compute, networking, and storage.
  • Foster strong customer and cross-functional partnerships to ensure consistent cluster support and adapt to evolving user needs.
  • Support researchers running EDA workloads, including performance analysis and optimizations.
  • Conduct root cause analysis and suggest corrective actions; proactively find and fix issues before they occur.
  • Build tooling to accelerate researcher velocity, debugging, and software performance at scale.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience.
  • Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible.
  • Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS, or Kubernetes. Applied experience with AI/HPC workflows that use MPI and NCCL.
  • Proficient with Linux (Rocky/CentOS/RHEL and/or Ubuntu distributions).
  • Solid understanding of container technologies such as Enroot and Docker.
  • Proficiency in Python and Bash.
  • Experience analyzing and tuning performance for a variety of EDA workloads; excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
  • Excellent communication and collaboration skills.
  • Passion for continual learning and staying current with HPC infrastructure technologies and approaches.

Ways to stand out

  • Background with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
  • Experience supporting EDA workloads and tools.
  • Familiarity with high-speed networking for HPC, including InfiniBand, RDMA, and RoCE.
  • Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads.
  • Familiarity with metrics collection and visualization at scale using Prometheus, OpenSearch, and Grafana.

Compensation & Benefits

  • Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
  • Base salary ranges provided in the posting:
    • Level 3: 152,000 USD - 241,500 USD
    • Level 4: 184,000 USD - 287,500 USD
  • You will also be eligible for equity and benefits (see NVIDIA benefits page referenced in the original posting).

Additional information

  • Applications for this job will be accepted at least until March 15, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.