Senior HPC Cluster Engineer

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Ansible @ 6 CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Debugging @ 4 Technical Leadership @ 4 CUDA @ 4 GPU @ 4 Observability @ 4 AI @ 4 InfiniBand @ 3 NCCL @ 4 Slurm @ 4 HPC @ 4 Performance Analysis @ 4

Details

NVIDIA is seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU compute clusters for EDA (Electronic Design Automation) and high-performance computing workloads used across multiple teams and projects. You will collaborate with researchers and infrastructure teams to ensure clusters are performant, scalable, and reliable.

Responsibilities

Develop and enhance the ecosystem around GPU-accelerated computing, including building scalable automation solutions.
Continuously improve infrastructure provisioning, management, observability, and day-to-day operations through automation.
Provide technical leadership and strategic guidance for managing large-scale HPC systems, including deployment of compute, networking, and storage.
Foster strong customer and cross-functional partnerships to ensure consistent cluster support and adapt to evolving user needs.
Support researchers running EDA workloads, including performance analysis and optimizations.
Conduct root cause analysis and suggest corrective actions; proactively find and fix issues before they occur.
Build tooling to accelerate researcher velocity, debugging, and software performance at scale.

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience.
Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible.
Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS, or Kubernetes. Applied experience with AI/HPC workflows that use MPI and NCCL.
Proficient with Linux (Rocky/CentOS/RHEL and/or Ubuntu distributions).
Solid understanding of container technologies such as Enroot and Docker.
Proficiency in Python and Bash.
Experience analyzing and tuning performance for a variety of EDA workloads; excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and collaboration skills.
Passion for continual learning and staying current with HPC infrastructure technologies and approaches.

Ways to stand out

Background with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
Experience supporting EDA workloads and tools.
Familiarity with high-speed networking for HPC, including InfiniBand, RDMA, and RoCE.
Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads.
Familiarity with metrics collection and visualization at scale using Prometheus, OpenSearch, and Grafana.

Compensation & Benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
Base salary ranges provided in the posting:
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
You will also be eligible for equity and benefits (see NVIDIA benefits page referenced in the original posting).

Additional information

Applications for this job will be accepted at least until March 15, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.