Senior System Software Engineer, NCCL - Partner Enablement

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

System Administration @ 4 Ansible @ 3 Docker @ 3 Kubernetes @ 3 Linux @ 7 Python @ 7 GCP @ 4 Machine Learning @ 4 TensorFlow @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Networking @ 4 Parallel Programming @ 4 Debugging @ 4 PyTorch @ 4 CUDA @ 3 GPU @ 4

Details

NVIDIA is leading developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. The GPU Communications Libraries and Networking team delivers communication runtimes like NCCL and NVSHMEM for Deep Learning and HPC applications. This role (Partner Enablement Engineer) guides key partners and customers with NCCL, working end-to-end on AI networking stacks across large GPU clusters and cloud platforms.

Responsibilities

Engage with partners and customers to root cause functional and performance issues reported with NCCL.
Conduct performance characterization and analysis of NCCL and deep learning (DL) applications on GPU clusters.
Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP).
Guide customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters.
Document and conduct trainings/webinars for NCCL.
Engage with internal teams across time zones on networking, GPUs, storage, infrastructure and support.

Requirements

B.S./M.S. in Computer Science/Computer Engineering or equivalent experience with 5+ years of relevant experience.
Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design.
Experience supporting HPC or AI engineering or academic research communities.
Practical experience with high-performance networking: Infiniband, RoCE, Ethernet, RDMA, topologies, congestion control.
Strong Linux fundamentals and proficiency with a scripting language (preferably Python).
Familiarity with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible).
Adaptability and passion to learn new areas and tools; ability to work across teams and timezones.

Ways to Stand Out

Experience conducting performance benchmarking and developing infrastructure on HPC clusters; prior system administration for large clusters.
Experience debugging network configuration issues in large-scale deployments.
Familiarity with CUDA programming and/or GPUs.
Good understanding of machine learning concepts and experience with deep learning frameworks such as PyTorch, TensorFlow.

Compensation & Benefits

Base salary ranges (determined by location and experience):
- Level 3: 152,000 USD - 218,500 USD
- Level 4: 184,000 USD - 287,500 USD
Eligible for equity and benefits (see NVIDIA benefits page).

Application Information

Applications accepted at least until January 18, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.