Senior Deep Learning Framework Communications Engineer

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Python @ 4 Communication @ 4 Parallel Programming @ 7 System Architecture @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is driving advances in Artificial Intelligence, High Performance Computing and Visualization. The GPU is at the heart of our products and services. We are looking for a motivated Deep Learning engineer to bring advanced communication technologies into AI stacks (including PyTorch, TRT-LLM, vLLM, SGLang, JAX, etc.). You will work with the team that created communication libraries like NCCL and NVSHMEM and technologies like GPUDirect to scale Deep Learning and HPC applications across multi-GPU systems.

Responsibilities

  • Integrate new communication library features into AI frameworks: from proof-of-concept to performance analysis to production.
  • Perform deep analysis of AI workloads and frameworks to identify multi-GPU communication requirements and opportunities; collaborate hands-on with teams working on the latest AI models.
  • Improve AI compilers to hide communications or perform automatic fusion.
  • Conduct in-depth AI workload performance characterization on multi-GPU clusters.
  • Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads.
  • Author custom communication or fused compute-communication kernels to demonstrate performance on NVIDIA platforms.
  • Influence the roadmap of communication libraries such as NCCL and NVSHMEM.
  • Collaborate with a dynamic, distributed team across multiple time zones.

Requirements

  • B.S., M.S., or Ph.D. in Computer Science or a related field (or equivalent experience) with 5+ years of software engineering and HPC/AI experience.
  • Development or integration experience with deep learning frameworks such as PyTorch and JAX, and inference engines such as TRT-LLM, vLLM, SGLang.
  • Rapid prototyping and development experience with Python, C++, CUDA or related DSLs (Triton, cuTe).
  • Solid understanding of AI models, parallelisms, and/or compiler technologies (for example, torch.compile).
  • Experience conducting performance benchmarking on AI clusters and familiarity with at least one profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems).
  • Understanding of HPC/AI communication concepts (one-sided vs two-sided communication, elasticity, resiliency, topology discovery, etc.).
  • Adaptability and willingness to learn new areas and tools; flexibility to work and communicate across teams and time zones.

Ways to stand out

  • Experience with parallel programming on at least one communication runtime (NCCL, NVSHMEM, MPI) and strong systems-software fundamentals (computer system architecture, HW-SW interactions, operating systems principles).
  • Expertise in one or more areas: distributed training, distributed inference, MoE, reinforcement learning, kernel authoring (CUDA, Triton, cuTe), and programming for compute & communication overlap in distributed runtimes.
  • Experience with AI compiler pattern matching and lowering; solid understanding of memory hierarchy, consistency models, and tensor layout.

Compensation & Benefits

  • Base salary ranges by level:
    • Level 3: 152,000 USD - 241,500 USD
    • Level 4: 184,000 USD - 287,500 USD
  • You will also be eligible for equity and benefits (see NVIDIA benefits page).

Additional information

  • Applications for this job will be accepted at least until January 26, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.