Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 3 Docker @ 3 Kubernetes @ 3 Python @ 6 TensorFlow @ 4 Communication @ 4 Networking @ 4 Parallel Programming @ 6 Debugging @ 4 System Architecture @ 4 PyTorch @ 4 CUDA @ 3 GPU @ 4Details
NVIDIA is leading developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, NVIDIA's invention, serves as the visual cortex of modern computers and is at the heart of the company's products and services. The GPU Communications Libraries and Networking team delivers libraries like NCCL, NVSHMEM and UCX for Deep Learning and HPC. This role is focused on performance engineering for communication libraries used at large scale (multi-GPU, multi-node) clusters.
Responsibilities
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Study interactions of libraries with hardware (GPU, CPU, networking) and software components in the stack.
- Evaluate proofs-of-concept and perform trade-off analysis among multiple solutions.
- Triage and root-cause performance issues reported by customers.
- Collect performance data and build tools/infrastructure to visualize and analyze it.
- Collaborate with a dynamic team across multiple time zones.
Requirements
- M.S. (or equivalent experience) or PhD in Computer Science or a related field with relevant performance engineering and HPC experience.
- 3+ years of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
- Experience conducting performance benchmarking and triage on large-scale HPC clusters.
- Good understanding of computer system architecture, HW–SW interactions and operating systems principles (systems software fundamentals).
- Ability to implement micro-benchmarks in C/C++ and read/modify codebases when required.
- Ability to debug performance issues across the entire hardware/software stack.
- Proficiency in a scripting language, preferably Python.
- Familiarity with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker).
- Adaptability and willingness to learn new areas and tools; ability to work across different teams and time zones.
Ways to stand out
- Practical experience with Infiniband/Ethernet networks (RDMA, topologies, congestion control).
- Experience debugging network issues in large-scale deployments.
- Familiarity with CUDA programming and/or GPUs.
- Experience with Deep Learning frameworks such as PyTorch or TensorFlow.
Compensation and benefits
- Base salary ranges by level:
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 184,000 USD - 287,500 USD
- Eligible for equity and benefits (link provided by employer).
Applications for this job will be accepted at least until October 22, 2025.
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The company does not discriminate on the basis of protected characteristics.