Used Tools & Technologies
Not specified
Required Skills & Competences ?
Kubernetes @ 7 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Networking @ 7 Performance Monitoring @ 4 Technical Leadership @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, NVIDIA is applying AI to define the next era of computing — where GPUs act as the brains of computers, robots, and self-driving cars that can understand the world. The team is seeking a Senior Deep Learning Systems Engineer to build and scale training libraries and infrastructure for end-to-end autonomous driving models, enabling training on thousands of GPUs and massive datasets.
Responsibilities
- Craft, scale, and harden deep learning infrastructure libraries and frameworks for training on multi-thousand GPU clusters.
- Improve efficiency across the training stack: data loaders, distributed training, scheduling, and performance monitoring.
- Build robust training pipelines and libraries to handle massive video datasets and enable rapid experimentation.
- Collaborate with researchers, model engineers, and internal platform teams to enhance efficiency, minimize stalls, and improve training availability.
- Own core infrastructure components such as orchestration libraries, distributed training frameworks, and fault-resilient training systems.
- Partner with leadership to ensure infrastructure scales with growing GPU capacity and dataset size while maintaining developer efficiency and stability.
Requirements
- BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, or a related field, or equivalent experience.
- 12+ years of professional experience building and scaling high-performance distributed systems, ideally in ML, HPC, or large-scale data infrastructure.
- Extensive knowledge of deep learning frameworks (PyTorch preferred) and large-scale training technologies (DDP/FSDP, NCCL, tensor and pipeline parallelism).
- Experience with performance profiling and optimizing training stacks.
- Strong systems background including datacenter networking (RoCE, InfiniBand), parallel filesystems (Lustre), storage systems, and schedulers (Slurm, Kubernetes).
- Proficiency in Python and C++, with experience writing production-grade libraries, orchestration layers, and automation tools.
- Ability to work closely with cross-functional teams (ML researchers, infra engineers, product leads) and translate requirements into robust systems.
Ways to stand out
- Experience scaling large GPU training clusters with >1,000 GPUs.
- Contributions to open-source ML systems libraries (e.g., PyTorch, NCCL, FSDP, schedulers, storage clients).
- Expertise in fault resilience and high availability, including elastic training and large-scale observability.
- Hands-on technical leadership experience, establishing guidelines for ML systems engineering.
- Familiarity with reinforcement learning at scale, especially for simulation-heavy workloads.
Compensation & Benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and internal pay equity).
- Eligible for equity and additional benefits (see NVIDIA benefits).
Other details
- Location: Santa Clara, CA, United States.
- Employment type: Full time.
- Applications accepted at least until October 6, 2025.
- NVIDIA is an equal opportunity employer and fosters a diverse work environment.
#deeplearning