Senior Deep Learning Frameworks CUDA Software Engineer

at Nvidia

📍 Santa Clara, United States

USD 224,000-356,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 4 Machine Learning @ 4 Communication @ 4 System Architecture @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 Reinforcement Learning @ 4 vLLM @ 4 NCCL @ 4 TensorRT @ 4 SGLang @ 4 HPC @ 4 Performance Analysis @ 4 JAX @ 4

Details

NVIDIA is looking for a motivated Deep Learning engineer to bring advanced CUDA features and distributed runtime technologies into AI stacks (PyTorch, TRT-LLM, vLLM, SGLang, JAX, etc.). You will work with teams that created core CUDA features and runtimes for scaling Deep Learning and HPC applications, addressing multi-GPU demands from training at very large scale to low-latency inference.

Responsibilities

Integrate new CUDA features and runtime abstractions into AI frameworks: from proof-of-concept to performance analysis to production.
Perform deep analysis of AI workloads and frameworks to identify requirements and opportunities to innovate in the lower layers of the stack; collaborate hands-on with teams working on the latest AI models.
Own and drive improvements in the AI compiler-runtime interface to build high-performance multi-GPU, multi-node solutions.
Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads.
Influence the roadmap of core CUDA to facilitate next-generation deep learning frameworks.
Collaborate across multiple time zones with AI researchers, hardware and software architects, kernel and compiler authors, and CUDA driver experts to co-design systems and frameworks that enhance performance and programmability.
Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning.
Write clean, effective, and maintainable code so prototypes can transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products.

Requirements

BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience).
8+ years of relevant industry experience or equivalent academic experience after completed degree.
Development experience with deep learning frameworks such as PyTorch and JAX, and inference engines such as TRT-LLM, vLLM, SGLang.
Rapid prototyping and development experience with Python, C++, CUDA or related domain-specific languages.
Solid grasp of AI models, parallelisms, and/or compiler technologies (e.g., torch.compile).
Experience conducting performance benchmarking on AI clusters and familiarity with at least one performance profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems).
Understanding of HPC/AI communication concepts and communication libraries.
Good understanding of computer system architecture, HW–SW interactions and operating systems principles (systems software fundamentals).
Adaptability and passion to learn new frameworks and tools, and flexibility to work and communicate effectively across different teams and time zones.

Ways to stand out

Deep expertise in performance internals and execution graphs of major deep learning autograd, training and inference frameworks (PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, etc.).
Hands-on experience with CUDA, communication libraries (NCCL, MPI, UCX) and distributed machine learning techniques (pipeline parallelism, tensor parallelism).
Expertise in areas such as training, distributed inference, Mixture of Experts (MoE), reinforcement learning, or kernel authoring (CUDA, Triton, cuTe).
Background in deep learning compilers, both graph-level and codegen (Triton, XLA, torch.compile).
Experience programming for compute & communication overlap in distributed runtimes.

Compensation & Benefits

Base salary ranges published in the posting:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
You will also be eligible for equity and benefits (link to NVIDIA benefits referenced in the original posting).
Applications accepted at least until May 18, 2026.

Company

NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The posting includes standard non-discrimination statements.