Senior Deep Learning Engineer – Autonomous Vehicles

at Nvidia
USD 224,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Kubernetes @ 7 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Networking @ 7 Performance Monitoring @ 4 Technical Leadership @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, NVIDIA is applying AI to define the next era of computing — where GPUs act as the brains of computers, robots, and self-driving cars that can understand the world. The team is seeking a Senior Deep Learning Systems Engineer to build and scale training libraries and infrastructure for end-to-end autonomous driving models, enabling training on thousands of GPUs and massive datasets.

Responsibilities

  • Craft, scale, and harden deep learning infrastructure libraries and frameworks for training on multi-thousand GPU clusters.
  • Improve efficiency across the training stack: data loaders, distributed training, scheduling, and performance monitoring.
  • Build robust training pipelines and libraries to handle massive video datasets and enable rapid experimentation.
  • Collaborate with researchers, model engineers, and internal platform teams to enhance efficiency, minimize stalls, and improve training availability.
  • Own core infrastructure components such as orchestration libraries, distributed training frameworks, and fault-resilient training systems.
  • Partner with leadership to ensure infrastructure scales with growing GPU capacity and dataset size while maintaining developer efficiency and stability.

Requirements

  • BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, or a related field, or equivalent experience.
  • 12+ years of professional experience building and scaling high-performance distributed systems, ideally in ML, HPC, or large-scale data infrastructure.
  • Extensive knowledge of deep learning frameworks (PyTorch preferred) and large-scale training technologies (DDP/FSDP, NCCL, tensor and pipeline parallelism).
  • Experience with performance profiling and optimizing training stacks.
  • Strong systems background including datacenter networking (RoCE, InfiniBand), parallel filesystems (Lustre), storage systems, and schedulers (Slurm, Kubernetes).
  • Proficiency in Python and C++, with experience writing production-grade libraries, orchestration layers, and automation tools.
  • Ability to work closely with cross-functional teams (ML researchers, infra engineers, product leads) and translate requirements into robust systems.

Ways to stand out

  • Experience scaling large GPU training clusters with >1,000 GPUs.
  • Contributions to open-source ML systems libraries (e.g., PyTorch, NCCL, FSDP, schedulers, storage clients).
  • Expertise in fault resilience and high availability, including elastic training and large-scale observability.
  • Hands-on technical leadership experience, establishing guidelines for ML systems engineering.
  • Familiarity with reinforcement learning at scale, especially for simulation-heavy workloads.

Compensation & Benefits

  • Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and internal pay equity).
  • Eligible for equity and additional benefits (see NVIDIA benefits).

Other details

  • Location: Santa Clara, CA, United States.
  • Employment type: Full time.
  • Applications accepted at least until October 6, 2025.
  • NVIDIA is an equal opportunity employer and fosters a diverse work environment.

#deeplearning