Used Tools & Technologies
Not specified
Required Skills & Competences ?
CentOS @ 6 Docker @ 4 Grafana @ 3 Kubernetes @ 4 Linux @ 6 Prometheus @ 3 Python @ 6 Algorithms @ 4 Machine Learning @ 4 TensorFlow @ 4 Hiring @ 4 Leadership @ 4 Bash @ 6 Communication @ 4 Networking @ 4 Rust @ 6 PyTorch @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is hiring an experienced engineer to design and implement GPU compute clusters for deep learning and high-performance computing (HPC). The role focuses on operating and scaling large HPC/AI compute infrastructure, building automation and tooling, supporting researchers running AI/HPC workloads, and driving performance analysis and optimization.
Responsibilities
- Provide leadership and strategic mentorship on managing large-scale HPC systems, including deployment of compute, networking, and storage.
- Develop and improve the ecosystem around GPU-accelerated computing, including building scalable automation solutions.
- Build and nurture customer and cross-team relationships to support clusters and address changing user needs.
- Support researchers running workloads, including performance analysis and optimizations.
- Conduct root cause analysis and suggest corrective actions; proactively find and fix issues before they occur.
- Build innovative tooling to accelerate researcher velocity, troubleshooting, and software performance at scale.
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field, or equivalent experience.
- Minimum of 6 years of experience crafting and operating large-scale compute infrastructure.
- Experience with AI/HPC job schedulers and orchestrators such as Slurm, Kubernetes (K8s), or LSF. Applied experience with AI/HPC workflows that use MPI and NCCL.
- Proficient in Linux (CentOS/RHEL and/or Ubuntu distributions).
- Solid understanding of container technologies (Enroot, Docker, Podman).
- Proficiency in one scripting language (Python or Bash) and at least one compiled language (Golang, Rust, C, C++).
- Experience analyzing and tuning performance for a variety of AI/HPC workloads; strong problem-solving skills to analyze complex systems, identify bottlenecks, and implement scalable solutions.
- Excellent communication and teamwork skills.
- Passion for continual learning and staying current with technologies and approaches in HPC and AI/ML infrastructure.
Preferred / Nice to Have
- Experience with NVIDIA GPUs, CUDA programming, NCCL, and MLPerf benchmarking.
- Experience with machine learning and deep learning concepts, algorithms, and models.
- Familiarity with high-speed networking for HPC (InfiniBand, RDMA, RoCE) and Amazon EFA.
- Understanding of distributed storage systems for AI/HPC (Lustre, GPFS).
- Experience with deep learning frameworks such as PyTorch, MegatronLM, and TensorFlow.
- Familiarity with metrics collection and visualization at scale (Prometheus, OpenSearch, Grafana).
Benefits
- Competitive base salary. Base salary ranges provided by level: Level 4 — 184,000 USD to 287,500 USD; Level 5 — 224,000 USD to 356,500 USD.
- Eligible for equity and additional benefits (see NVIDIA benefits).
- NVIDIA is an equal opportunity employer committed to a diverse work environment.
Application deadline: Applications accepted at least until August 3, 2025.