Senior DL Software Engineer, Model Optimization and Edge Deployment - Autonomous Vehicles
at Nvidia
USD 184,000-356,500 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Algorithms @ 4
Machine Learning @ 6
LLM @ 4
PyTorch @ 4
CUDA @ 3
GPU @ 7
Deep Learning @ 4
AI @ 7
Profiling @ 4
Robotics @ 4
vLLM @ 6
TensorRT @ 4
SGLang @ 6
JAX @ 6
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a high-caliber Deep Learning Engineer to bridge cutting-edge multimodal architectures and real-time robotic execution for autonomous vehicles. In this role you will design and implement state-of-the-art algorithms to make LLM/VLM models fast, lean, and reliable enough to power an end-to-end driving stack. You will re-architect models for the edge to meet strict latency and safety constraints of an AV compute platform and integrate large-scale models within a high-performance C++ production environment.
Responsibilities
- Develop state-of-the-art model optimization techniques (examples: speculative decoding with block diffusion, KV cache streaming, Prefill–Decode separation) to boost end-to-end model performance for production deployments.
- Implement advanced compression techniques including quantization (FP4/FP8), pruning, and knowledge distillation to minimize model footprints while preserving safety-critical accuracy.
- Design high-performance inference optimizations, including automated model sharding (tensor/sequence parallelism) and efficient attention kernels optimized for KV-caching.
- Conduct deep, layer-by-layer model profiling to identify compute and memory bottlenecks and drive targeted optimizations for real-time execution.
- Leverage the PyTorch ecosystem to extract standardized model graph representations and automate deployment pipelines for TensorRT conversion.
- Scale deep learning model performance across NVIDIA edge architectures to maximize throughput of specialized accelerators on-road.
- Architect software interfaces to integrate and interact with large-scale models within a high-performance C++ production environment.
- Partner with research, TensorRT, and Cosmos teams to translate innovations into shipping product solutions.
Requirements
- PhD with 4+ years, MS with 6+ years, or BS (or equivalent experience) with 8+ years of relevant experience in Computer Science, Computer Engineering, or a related technical field.
- Expert-level proficiency in PyTorch, JAX, or similar machine learning frameworks.
- Sophisticated proficiency with modern LLM/VLM inference stacks such as vLLM, TensorRT-LLM and SGLang.
- Proven track record of training, deploying, or optimizing large-scale deep learning models in production environments.
- Deep familiarity with NVIDIA deep learning SDKs, specifically TensorRT and CUDA.
- Strong understanding of GPU architecture and the compilation stack, with the ability to debug end-to-end performance across the hardware/software boundary.
Ways to Stand Out
- Deep experience with LLM, VLM, and VLA model optimization tailored for real-time robotic control, embodied AI, and autonomous decision-making.
- Proven track record of implementing low-bit inference.
- Prior experience writing custom high-performance kernels using CUDA, Triton, or CUTLASS to accelerate non-standard layers and specialized attention mechanisms.
- Active contributions to open-source inference and optimization libraries such as vLLM, SGLang, and TensorRT-LLM.
- Thorough understanding of real-time robotics constraints including safety-critical determinism, hardware-in-the-loop (HIL) testing, and ultra-low latency requirements.
Compensation & Benefits
- Base salary range:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- Eligible for equity and benefits. (Link to NVIDIA benefits referenced in original posting.)
Additional Information
- Applications accepted at least until April 25, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to a diverse work environment.