Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 7
Python @ 6
Distributed Systems @ 4
Leadership @ 4
Networking @ 7
Performance Monitoring @ 4
Experimentation @ 4
PyTorch @ 4
GPU @ 4
Deep Learning @ 4
Observability @ 4
AI @ 4
InfiniBand @ 7
Reinforcement Learning @ 3
Profiling @ 4
NCCL @ 4
Slurm @ 7
HPC @ 8
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a Senior Deep Learning Systems Engineer to advance the Autonomous Vehicles project by building and scaling training libraries and infrastructure for end-to-end autonomous driving models. The role focuses on enabling training on multi-thousand GPU clusters and improving iteration speed, safety, and developer productivity through robust, high-performance infrastructure.
Responsibilities
- Craft, scale, and harden deep learning infrastructure libraries and frameworks for training on multi-thousand GPU clusters.
- Improve efficiency across the training stack: data loaders, distributed training, scheduling, and performance monitoring.
- Build robust training pipelines and libraries to handle massive video datasets and enable rapid experimentation.
- Collaborate with researchers, model engineers, and internal platform teams to enhance efficiency, minimize stalls, and improve training availability.
- Own core infrastructure components such as orchestration libraries, distributed training frameworks, and fault-resilient training systems.
- Partner with leadership to ensure infrastructure scales with growing GPU capacity and dataset size while maintaining developer efficiency and stability.
Requirements
- BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, or a related field, or equivalent experience.
- 12+ years of professional experience building and scaling high-performance distributed systems, ideally in ML, HPC, or large-scale data infrastructure.
- Extensive knowledge of deep learning frameworks (PyTorch preferred) and large-scale training (DDP/FSDP, NCCL, tensor and pipeline parallelism).
- Strong systems background including datacenter networking (RoCE, InfiniBand), parallel filesystems (Lustre), storage systems, and schedulers (Slurm, Kubernetes).
- Proficiency in Python and C++, with experience writing production-grade libraries, orchestration layers, and automation tools.
- Experience with performance profiling and optimizing large-scale training workflows.
- Ability to work closely with cross-functional teams (ML researchers, infra engineers, product leads) and translate requirements into robust systems.
Ways to stand out
- Experience scaling large GPU training clusters with >1,000 GPUs.
- Contributions to open-source ML systems libraries (e.g., PyTorch, NCCL, FSDP, schedulers, storage clients).
- Expertise in fault resilience and high availability, including elastic training and large-scale observability.
- Hands-on leadership experience as a technical authority for ML systems engineering.
- Familiarity with reinforcement learning at scale, particularly for simulation-heavy workloads.
Compensation & Benefits
- Base salary range: 224,000 USD - 356,500 USD (base determined by location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits (see NVIDIA benefits page).
Other details
- Applications accepted at least until July 3, 2026. This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.