Senior Deep Learning Communication Architect

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 7 Algorithms @ 4 Hiring @ 4 Communication @ 4 Networking @ 3 LLM @ 4 PyTorch @ 6 CUDA @ 3 GPU @ 3

Details

NVIDIA's software architecture group is hiring a Deep Learning Communication Architect to scale DNN models and training/inference frameworks to systems with hundreds of thousands of nodes. The role focuses on optimizing communication performance, designing efficient communication protocols, and collaborating with hardware and software teams to apply high-speed interconnects and communication libraries for large-scale deep learning workloads. The team researches new communication technologies, builds proofs-of-concept, and performs quantitative modeling to validate and deploy new communication strategies.

Responsibilities

Identify and eliminate bottlenecks in data transfer and synchronization during distributed deep learning training and inference.
Design and implement communication algorithms and protocols tailored for deep learning workloads to minimize communication overhead and latency.
Collaborate with hardware and software teams to co-design systems using high-speed interconnects (e.g., NVLink, InfiniBand, SPC-X) and communication libraries (MPI, NCCL, UCX, UCC, NVSHMEM).
Research and evaluate new communication technologies and techniques to improve performance and scalability of deep learning systems.
Build proofs-of-concept, run experiments, and perform quantitative modeling to validate and deploy new communication strategies.
Work on optimizing LLM training and inference performance on cutting-edge hardware and at large scale.

Requirements

BS, MS, or PhD in Computer Science, Electrical Engineering, CSEE, or a closely related field, or equivalent experience.
6+ years of experience building and scaling DNNs, parallelism of DNN frameworks, or working on deep learning training and inference workloads.
Experience evaluating, analyzing, and optimizing LLM training and inference performance of state-of-the-art models on modern hardware.
Deep understanding of parallelism techniques: Data Parallelism, Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, and FSDP.
Understanding of emerging serving architectures such as Disaggregated Serving and inference servers like Dynamo and Triton.
Proficiency developing code for one or more DNN training and inference frameworks (examples listed: PyTorch, TensorRT-LLM, vLLM, SGLang).
Strong programming skills in C++ and Python.
Familiarity with GPU computing (CUDA, OpenCL) and networking technologies such as InfiniBand and RoCE.
Experience collaborating across hardware and software stacks and working with communication libraries (MPI, NCCL, UCX, UCC, NVSHMEM).
Ability to conduct experiments, quantitative modeling, and build proofs-of-concept to validate new approaches.

Ways to Stand Out

Prior contributions to one or more DNN training and inference frameworks.
Deep understanding and contributions to scaling LLMs on large-scale systems.

Compensation & Benefits

Base salary ranges provided by location and level: Level 4: 184,000 USD - 287,500 USD; Level 5: 224,000 USD - 356,500 USD.
Eligible for equity and benefits (see NVIDIA benefits).

Other Details

Location: Santa Clara, CA, United States.
Employment type: Full time.
Applications accepted at least until July 29, 2025.
NVIDIA is an equal opportunity employer committed to diversity and inclusion.