Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 3 Docker @ 3 Kubernetes @ 3 Python @ 6 TensorFlow @ 4 Communication @ 6 Networking @ 4 Parallel Programming @ 6 Debugging @ 4 System Architecture @ 7 PyTorch @ 4 CUDA @ 3 GPU @ 4Details
NVIDIA's GPU Communications Libraries and Networking team delivers libraries such as NCCL, NVSHMEM and UCX for Deep Learning and HPC. We are seeking a motivated performance engineer to influence the roadmap of our communication libraries and improve communication performance across multi-GPU and multi-node clusters. This role focuses on performance characterization, analysis, tooling, triage and collaboration across hardware and software stacks.
Responsibilities
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Study interactions of libraries with hardware (GPU, CPU, networking) and software components across the stack.
- Evaluate proof-of-concepts and perform trade-off analysis for alternative solutions.
- Triage and root-cause performance issues reported by customers.
- Collect large volumes of performance data; build tools and infrastructure to visualize and analyze information.
- Collaborate with a dynamic, cross-time-zone team.
Requirements
- M.S. (or equivalent experience) or Ph.D. in Computer Science or a related field with relevant performance engineering and HPC experience.
- 3+ years of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
- Experience conducting performance benchmarking and triage on large-scale HPC clusters.
- Strong understanding of computer system architecture, hardware-software interactions and operating systems principles.
- Ability to implement micro-benchmarks in C/C++ and read/modify existing code bases.
- Ability to debug performance issues across the entire HW/SW stack.
- Proficient in a scripting language, preferably Python.
- Familiarity with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker).
- Adaptability and willingness to learn new tools and areas; ability to work and communicate effectively across teams and time zones.
Ways to stand out
- Practical experience with Infiniband/Ethernet networks (RDMA, topologies, congestion control).
- Experience debugging network issues in large-scale deployments.
- Familiarity with CUDA programming and/or GPUs.
- Experience with deep learning frameworks such as PyTorch or TensorFlow.
Compensation and benefits
- Base salary ranges provided by location and level:
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 184,000 USD - 287,500 USD
- Eligible for equity and benefits (link to NVIDIA benefits).
Other information
- Applications accepted at least until August 12, 2025.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.