Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 6 CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Rust @ 6 Debugging @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is seeking a highly skilled and experienced AI-HPC Cluster Engineer to design, deploy, and operate GPU compute clusters for EDA and high-performance computing workloads used across multiple teams and projects. You will collaborate with researchers and infrastructure teams to ensure GPU clusters perform efficiently, scale well, and remain reliable.
Responsibilities
- Provide leadership and strategic mentorship on the management of large-scale HPC systems including deployment of compute, networking, and storage.
- Develop and improve the ecosystem around GPU-accelerated computing, including developing scalable automation solutions.
- Continuously improve infrastructure provisioning, management, observability and day-to-day operation through automation.
- Build and nurture customer and cross-team relationships to consistently support clusters and address changing user needs.
- Support researchers to run their workloads, including performance analysis and optimizations.
- Conduct root-cause analysis and suggest corrective action; proactively find and fix issues before they occur.
- Build innovative tooling to accelerate researchers' velocity, debugging, and software performance at scale.
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
- Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible.
- Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS or Kubernetes. Applied experience with AI/HPC workflows that use MPI and NCCL.
- Proficient in Linux (Rocky/CentOS/RHEL and/or Ubuntu).
- Solid understanding of container technologies such as Enroot and Docker.
- Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++).
- Experience analyzing and tuning performance for a variety of AI/HPC workloads; strong problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
- Excellent communication and collaboration skills.
- Passion for continual learning and staying ahead of new technologies and approaches in HPC infrastructure.
Ways to stand out
- Background with NVIDIA GPUs, CUDA programming, NCCL and MLPerf benchmarking.
- Experience supporting EDA workloads and tools.
- Familiarity with high-speed networking for HPC including InfiniBand, RDMA and RoCE.
- Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC.
- Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana.
Compensation & Benefits
- Base salary ranges: $148,000 - $235,750 USD for Level 3, and $184,000 - $287,500 USD for Level 4. Your base salary will be determined based on your location, experience, and pay of employees in similar positions.
- Eligibility for equity and benefits (see NVIDIA benefits page).
Application & Other Info
- Applications for this job will be accepted at least until September 20, 2025.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.