Senior HPC Performance Engineer

at Nvidia
USD 148,000-287,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 3 Docker @ 3 Kubernetes @ 3 Python @ 6 TensorFlow @ 4 Communication @ 6 Networking @ 4 Parallel Programming @ 6 Debugging @ 4 System Architecture @ 7 PyTorch @ 4 CUDA @ 3 GPU @ 4

Details

NVIDIA's GPU Communications Libraries and Networking team delivers libraries such as NCCL, NVSHMEM and UCX for Deep Learning and HPC. We are seeking a motivated performance engineer to influence the roadmap of our communication libraries and improve communication performance across multi-GPU and multi-node clusters. This role focuses on performance characterization, analysis, tooling, triage and collaboration across hardware and software stacks.

Responsibilities

  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Study interactions of libraries with hardware (GPU, CPU, networking) and software components across the stack.
  • Evaluate proof-of-concepts and perform trade-off analysis for alternative solutions.
  • Triage and root-cause performance issues reported by customers.
  • Collect large volumes of performance data; build tools and infrastructure to visualize and analyze information.
  • Collaborate with a dynamic, cross-time-zone team.

Requirements

  • M.S. (or equivalent experience) or Ph.D. in Computer Science or a related field with relevant performance engineering and HPC experience.
  • 3+ years of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
  • Experience conducting performance benchmarking and triage on large-scale HPC clusters.
  • Strong understanding of computer system architecture, hardware-software interactions and operating systems principles.
  • Ability to implement micro-benchmarks in C/C++ and read/modify existing code bases.
  • Ability to debug performance issues across the entire HW/SW stack.
  • Proficient in a scripting language, preferably Python.
  • Familiarity with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker).
  • Adaptability and willingness to learn new tools and areas; ability to work and communicate effectively across teams and time zones.

Ways to stand out

  • Practical experience with Infiniband/Ethernet networks (RDMA, topologies, congestion control).
  • Experience debugging network issues in large-scale deployments.
  • Familiarity with CUDA programming and/or GPUs.
  • Experience with deep learning frameworks such as PyTorch or TensorFlow.

Compensation and benefits

  • Base salary ranges provided by location and level:
    • Level 3: 148,000 USD - 235,750 USD
    • Level 4: 184,000 USD - 287,500 USD
  • Eligible for equity and benefits (link to NVIDIA benefits).

Other information

  • Applications accepted at least until August 12, 2025.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.