Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 4
Machine Learning @ 4
Communication @ 4
System Architecture @ 4
LLM @ 4
PyTorch @ 4
CUDA @ 4
GPU @ 4
Deep Learning @ 4
AI @ 4
Reinforcement Learning @ 4
vLLM @ 4
NCCL @ 4
TensorRT @ 4
SGLang @ 4
HPC @ 4
Performance Analysis @ 4
JAX @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is looking for a motivated Deep Learning engineer to bring advanced CUDA features and distributed runtime technologies into AI stacks (PyTorch, TRT-LLM, vLLM, SGLang, JAX, etc.). You will work with teams that created core CUDA features and runtimes for scaling Deep Learning and HPC applications, addressing multi-GPU demands from training at very large scale to low-latency inference.
Responsibilities
- Integrate new CUDA features and runtime abstractions into AI frameworks: from proof-of-concept to performance analysis to production.
- Perform deep analysis of AI workloads and frameworks to identify requirements and opportunities to innovate in the lower layers of the stack; collaborate hands-on with teams working on the latest AI models.
- Own and drive improvements in the AI compiler-runtime interface to build high-performance multi-GPU, multi-node solutions.
- Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads.
- Influence the roadmap of core CUDA to facilitate next-generation deep learning frameworks.
- Collaborate across multiple time zones with AI researchers, hardware and software architects, kernel and compiler authors, and CUDA driver experts to co-design systems and frameworks that enhance performance and programmability.
- Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning.
- Write clean, effective, and maintainable code so prototypes can transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products.
Requirements
- BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience).
- 8+ years of relevant industry experience or equivalent academic experience after completed degree.
- Development experience with deep learning frameworks such as PyTorch and JAX, and inference engines such as TRT-LLM, vLLM, SGLang.
- Rapid prototyping and development experience with Python, C++, CUDA or related domain-specific languages.
- Solid grasp of AI models, parallelisms, and/or compiler technologies (e.g., torch.compile).
- Experience conducting performance benchmarking on AI clusters and familiarity with at least one performance profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems).
- Understanding of HPC/AI communication concepts and communication libraries.
- Good understanding of computer system architecture, HW–SW interactions and operating systems principles (systems software fundamentals).
- Adaptability and passion to learn new frameworks and tools, and flexibility to work and communicate effectively across different teams and time zones.
Ways to stand out
- Deep expertise in performance internals and execution graphs of major deep learning autograd, training and inference frameworks (PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, etc.).
- Hands-on experience with CUDA, communication libraries (NCCL, MPI, UCX) and distributed machine learning techniques (pipeline parallelism, tensor parallelism).
- Expertise in areas such as training, distributed inference, Mixture of Experts (MoE), reinforcement learning, or kernel authoring (CUDA, Triton, cuTe).
- Background in deep learning compilers, both graph-level and codegen (Triton, XLA, torch.compile).
- Experience programming for compute & communication overlap in distributed runtimes.
Compensation & Benefits
- Base salary ranges published in the posting:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits (link to NVIDIA benefits referenced in the original posting).
- Applications accepted at least until May 18, 2026.
Company
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The posting includes standard non-discrimination statements.