Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 4
Communication @ 4
Parallel Programming @ 7
System Architecture @ 7
LLM @ 4
PyTorch @ 4
CUDA @ 4
GPU @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is driving advances in Artificial Intelligence, High Performance Computing and Visualization. The GPU is at the heart of our products and services. We are looking for a motivated Deep Learning engineer to bring advanced communication technologies into AI stacks (including PyTorch, TRT-LLM, vLLM, SGLang, JAX, etc.). You will work with the team that created communication libraries like NCCL and NVSHMEM and technologies like GPUDirect to scale Deep Learning and HPC applications across multi-GPU systems.
Responsibilities
- Integrate new communication library features into AI frameworks: from proof-of-concept to performance analysis to production.
- Perform deep analysis of AI workloads and frameworks to identify multi-GPU communication requirements and opportunities; collaborate hands-on with teams working on the latest AI models.
- Improve AI compilers to hide communications or perform automatic fusion.
- Conduct in-depth AI workload performance characterization on multi-GPU clusters.
- Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads.
- Author custom communication or fused compute-communication kernels to demonstrate performance on NVIDIA platforms.
- Influence the roadmap of communication libraries such as NCCL and NVSHMEM.
- Collaborate with a dynamic, distributed team across multiple time zones.
Requirements
- B.S., M.S., or Ph.D. in Computer Science or a related field (or equivalent experience) with 5+ years of software engineering and HPC/AI experience.
- Development or integration experience with deep learning frameworks such as PyTorch and JAX, and inference engines such as TRT-LLM, vLLM, SGLang.
- Rapid prototyping and development experience with Python, C++, CUDA or related DSLs (Triton, cuTe).
- Solid understanding of AI models, parallelisms, and/or compiler technologies (for example, torch.compile).
- Experience conducting performance benchmarking on AI clusters and familiarity with at least one profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems).
- Understanding of HPC/AI communication concepts (one-sided vs two-sided communication, elasticity, resiliency, topology discovery, etc.).
- Adaptability and willingness to learn new areas and tools; flexibility to work and communicate across teams and time zones.
Ways to stand out
- Experience with parallel programming on at least one communication runtime (NCCL, NVSHMEM, MPI) and strong systems-software fundamentals (computer system architecture, HW-SW interactions, operating systems principles).
- Expertise in one or more areas: distributed training, distributed inference, MoE, reinforcement learning, kernel authoring (CUDA, Triton, cuTe), and programming for compute & communication overlap in distributed runtimes.
- Experience with AI compiler pattern matching and lowering; solid understanding of memory hierarchy, consistency models, and tensor layout.
Compensation & Benefits
- Base salary ranges by level:
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits page).
Additional information
- Applications for this job will be accepted at least until January 26, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.