Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Ansible @ 6
CentOS @ 6
Docker @ 4
Grafana @ 3
Kubernetes @ 4
Linux @ 6
Prometheus @ 3
Python @ 6
Leadership @ 4
Bash @ 6
Communication @ 4
Networking @ 4
Debugging @ 4
Technical Leadership @ 4
CUDA @ 4
GPU @ 4
Observability @ 4
AI @ 4
InfiniBand @ 3
NCCL @ 4
Slurm @ 4
HPC @ 4
Performance Analysis @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU compute clusters for EDA (Electronic Design Automation) and high-performance computing workloads used across multiple teams and projects. You will collaborate with researchers and infrastructure teams to ensure clusters are performant, scalable, and reliable.
Responsibilities
- Develop and enhance the ecosystem around GPU-accelerated computing, including building scalable automation solutions.
- Continuously improve infrastructure provisioning, management, observability, and day-to-day operations through automation.
- Provide technical leadership and strategic guidance for managing large-scale HPC systems, including deployment of compute, networking, and storage.
- Foster strong customer and cross-functional partnerships to ensure consistent cluster support and adapt to evolving user needs.
- Support researchers running EDA workloads, including performance analysis and optimizations.
- Conduct root cause analysis and suggest corrective actions; proactively find and fix issues before they occur.
- Build tooling to accelerate researcher velocity, debugging, and software performance at scale.
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience.
- Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible.
- Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS, or Kubernetes. Applied experience with AI/HPC workflows that use MPI and NCCL.
- Proficient with Linux (Rocky/CentOS/RHEL and/or Ubuntu distributions).
- Solid understanding of container technologies such as Enroot and Docker.
- Proficiency in Python and Bash.
- Experience analyzing and tuning performance for a variety of EDA workloads; excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
- Excellent communication and collaboration skills.
- Passion for continual learning and staying current with HPC infrastructure technologies and approaches.
Ways to stand out
- Background with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
- Experience supporting EDA workloads and tools.
- Familiarity with high-speed networking for HPC, including InfiniBand, RDMA, and RoCE.
- Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads.
- Familiarity with metrics collection and visualization at scale using Prometheus, OpenSearch, and Grafana.
Compensation & Benefits
- Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
- Base salary ranges provided in the posting:
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits page referenced in the original posting).
Additional information
- Applications for this job will be accepted at least until March 15, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.