Senior Deep Learning Engineer – Autonomous Vehicles

at Nvidia
USD 224,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Kubernetes @ 7 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Networking @ 7 Performance Monitoring @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4 Deep Learning @ 4 Observability @ 4 AI @ 4 InfiniBand @ 7 Reinforcement Learning @ 3 Profiling @ 4 NCCL @ 4 Slurm @ 7 HPC @ 8

Details

NVIDIA is seeking a Senior Deep Learning Systems Engineer to advance the Autonomous Vehicles project by building and scaling training libraries and infrastructure for end-to-end autonomous driving models. The role focuses on enabling training on multi-thousand GPU clusters and improving iteration speed, safety, and developer productivity through robust, high-performance infrastructure.

Responsibilities

  • Craft, scale, and harden deep learning infrastructure libraries and frameworks for training on multi-thousand GPU clusters.
  • Improve efficiency across the training stack: data loaders, distributed training, scheduling, and performance monitoring.
  • Build robust training pipelines and libraries to handle massive video datasets and enable rapid experimentation.
  • Collaborate with researchers, model engineers, and internal platform teams to enhance efficiency, minimize stalls, and improve training availability.
  • Own core infrastructure components such as orchestration libraries, distributed training frameworks, and fault-resilient training systems.
  • Partner with leadership to ensure infrastructure scales with growing GPU capacity and dataset size while maintaining developer efficiency and stability.

Requirements

  • BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, or a related field, or equivalent experience.
  • 12+ years of professional experience building and scaling high-performance distributed systems, ideally in ML, HPC, or large-scale data infrastructure.
  • Extensive knowledge of deep learning frameworks (PyTorch preferred) and large-scale training (DDP/FSDP, NCCL, tensor and pipeline parallelism).
  • Strong systems background including datacenter networking (RoCE, InfiniBand), parallel filesystems (Lustre), storage systems, and schedulers (Slurm, Kubernetes).
  • Proficiency in Python and C++, with experience writing production-grade libraries, orchestration layers, and automation tools.
  • Experience with performance profiling and optimizing large-scale training workflows.
  • Ability to work closely with cross-functional teams (ML researchers, infra engineers, product leads) and translate requirements into robust systems.

Ways to stand out

  • Experience scaling large GPU training clusters with >1,000 GPUs.
  • Contributions to open-source ML systems libraries (e.g., PyTorch, NCCL, FSDP, schedulers, storage clients).
  • Expertise in fault resilience and high availability, including elastic training and large-scale observability.
  • Hands-on leadership experience as a technical authority for ML systems engineering.
  • Familiarity with reinforcement learning at scale, particularly for simulation-heavy workloads.

Compensation & Benefits

  • Base salary range: 224,000 USD - 356,500 USD (base determined by location, experience, and pay of employees in similar positions).
  • Eligible for equity and benefits (see NVIDIA benefits page).

Other details

  • Applications accepted at least until July 3, 2026. This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.