Senior DL Software Engineer, Model Optimization and Edge Deployment - Autonomous Vehicles

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Algorithms @ 4 Machine Learning @ 6 LLM @ 4 PyTorch @ 4 CUDA @ 3 GPU @ 7 Deep Learning @ 4 AI @ 7 Profiling @ 4 Robotics @ 4 vLLM @ 6 TensorRT @ 4 SGLang @ 6 JAX @ 6

Details

NVIDIA is seeking a high-caliber Deep Learning Engineer to bridge cutting-edge multimodal architectures and real-time robotic execution for autonomous vehicles. In this role you will design and implement state-of-the-art algorithms to make LLM/VLM models fast, lean, and reliable enough to power an end-to-end driving stack. You will re-architect models for the edge to meet strict latency and safety constraints of an AV compute platform and integrate large-scale models within a high-performance C++ production environment.

Responsibilities

Develop state-of-the-art model optimization techniques (examples: speculative decoding with block diffusion, KV cache streaming, Prefill–Decode separation) to boost end-to-end model performance for production deployments.
Implement advanced compression techniques including quantization (FP4/FP8), pruning, and knowledge distillation to minimize model footprints while preserving safety-critical accuracy.
Design high-performance inference optimizations, including automated model sharding (tensor/sequence parallelism) and efficient attention kernels optimized for KV-caching.
Conduct deep, layer-by-layer model profiling to identify compute and memory bottlenecks and drive targeted optimizations for real-time execution.
Leverage the PyTorch ecosystem to extract standardized model graph representations and automate deployment pipelines for TensorRT conversion.
Scale deep learning model performance across NVIDIA edge architectures to maximize throughput of specialized accelerators on-road.
Architect software interfaces to integrate and interact with large-scale models within a high-performance C++ production environment.
Partner with research, TensorRT, and Cosmos teams to translate innovations into shipping product solutions.

Requirements

PhD with 4+ years, MS with 6+ years, or BS (or equivalent experience) with 8+ years of relevant experience in Computer Science, Computer Engineering, or a related technical field.
Expert-level proficiency in PyTorch, JAX, or similar machine learning frameworks.
Sophisticated proficiency with modern LLM/VLM inference stacks such as vLLM, TensorRT-LLM and SGLang.
Proven track record of training, deploying, or optimizing large-scale deep learning models in production environments.
Deep familiarity with NVIDIA deep learning SDKs, specifically TensorRT and CUDA.
Strong understanding of GPU architecture and the compilation stack, with the ability to debug end-to-end performance across the hardware/software boundary.

Ways to Stand Out

Deep experience with LLM, VLM, and VLA model optimization tailored for real-time robotic control, embodied AI, and autonomous decision-making.
Proven track record of implementing low-bit inference.
Prior experience writing custom high-performance kernels using CUDA, Triton, or CUTLASS to accelerate non-standard layers and specialized attention mechanisms.
Active contributions to open-source inference and optimization libraries such as vLLM, SGLang, and TensorRT-LLM.
Thorough understanding of real-time robotics constraints including safety-critical determinism, hardware-in-the-loop (HIL) testing, and ultra-low latency requirements.

Compensation & Benefits

Base salary range:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
Eligible for equity and benefits. (Link to NVIDIA benefits referenced in original posting.)

Additional Information

Applications accepted at least until April 25, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to a diverse work environment.