Senior Deep Learning Engineer – Autonomous Vehicles

at Nvidia

📍 Santa Clara, United States

USD 224,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Kubernetes @ 7 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Networking @ 7 Performance Monitoring @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, we are tapping into the unlimited potential of AI to define the next era of computing. We are seeking a Senior Deep Learning Systems Engineer to propel NVIDIA's Autonomous Vehicles project forward by building and scaling training libraries and infrastructure for end-to-end autonomous driving models.

Responsibilities

Craft, scale, and harden deep learning infrastructure libraries and frameworks for training on multi-thousand GPU clusters.
Improve efficiency throughout the training stack: data loaders, distributed training, scheduling, and performance monitoring.
Build robust training pipelines and libraries to handle massive video datasets and enable rapid experimentation.
Collaborate with researchers, model engineers, and internal platform teams to enhance efficiency, minimize stalls, and improve training availability.
Own core infrastructure components such as orchestration libraries, distributed training frameworks, and fault-resilient training systems.
Partner with leadership to ensure infrastructure scales with growing GPU capacity and dataset size while maintaining developer efficiency and stability.

Requirements

BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, or a related field, or equivalent experience.
12+ years of professional experience building and scaling high-performance distributed systems, ideally in ML, HPC, or large-scale data infrastructure.
Extensive knowledge of deep learning frameworks (PyTorch preferred), large-scale training (DDP/FSDP, NCCL, tensor/pipeline parallelism), and performance profiling.
Strong systems background: datacenter networking (RoCE, IB), parallel filesystems (Lustre), storage systems, schedulers (Slurm, Kubernetes, etc.).
Proficiency in Python and C++, with experience writing production-grade libraries, orchestration layers, and automation tools.
Ability to work closely with multi-functional teams (ML researchers, infra engineers, product leads) and translate requirements into robust systems.

Ways to stand out

Shown experience scaling large GPU training clusters with >1,000 GPUs.
Contributions to open-source ML systems libraries (e.g., PyTorch, NCCL, FSDP, schedulers, storage clients).
Expertise in fault resilience and high availability, including elastic training and large-scale observability.
Tried leadership skills as a hands-on technical authority, encouraging others and establishing guidelines for ML systems engineering.
Familiarity with reinforcement learning (RL) at scale, particularly in the context of simulation-heavy workloads.

Compensation & Benefits

Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and internal pay parity).
Eligible for equity and benefits. Applications accepted at least until October 6, 2025.

Diversity & Inclusion

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. They do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#deeplearning