Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 5 Leadership @ 3 Communication @ 3 Networking @ 6 PyTorch @ 3 CUDA @ 1 GPU @ 6Details
We are seeking a Distinguished Engineer for AI Resiliency at NVIDIA. In this role you will architect, design, and develop world-class software resiliency features for training groundbreaking AI models on the largest AI superclusters in the world. You will lead a team of cross-functional experts to drive and shape an end-to-end AI software stack, ensuring near-zero downtime for training on industry-leading frameworks like PyTorch and JAX/XLA. Your work will span algorithmic innovations and robust software architecture with direct exposure to NVIDIA senior leadership.
Responsibilities
- Define a scalable software architecture to enable single-job resilient training on hundreds of thousands of GPUs with minimal downtime.
- Design and deliver modular, resilient software features to support large-scale AI training for top customers.
- Innovate and evolve resilient architecture designs to meet stringent uptime requirements (downtime < 1%), using approaches such as in-memory checkpointing, in-process restart, and anomaly/SDC detection.
- Collaborate closely with internal partners, spearhead project execution, and communicate regular progress updates to senior leadership.
- Lead and mentor cross-functional engineering teams across the AI software stack.
Requirements
- Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience.
- 15+ years of experience in software architecture or related fields, with deep understanding of AI-optimized systems.
- At least 5 years of hands-on software development on high-complexity projects involving HPC or AI, ideally across the full lifecycle (design to deployment) of large-scale HPC systems.
- Excellent collaboration and communication skills across multiple engineering teams.
Ways to Stand Out
- Proven experience with large-scale AI supercomputing applications, particularly in the training phase.
- 5+ years of experience using and contributing to modern AI frameworks like PyTorch and JAX/XLA for large-scale training workloads.
- Strong passion and experience designing system architectures tailored for AI (CPU, GPU, memory, storage, networking).
- Experience implementing HPC software development best practices in large-scale systems.
- Experience collaborating with deep learning research teams, CUDA kernel and framework development, and/or silicon architecture teams is a plus.
Benefits
- Base salary range: 308,000 USD - 471,500 USD (actual base salary determined by location, experience, and pay of employees in similar positions).
- Eligible for equity and company benefits. (See NVIDIA benefits page for details.)
Additional Details
- Applications accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer committed to diversity and inclusion.