Distinguished Engineer, AI Resiliency

at Nvidia
USD 308,000-471,500 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 5 Leadership @ 3 Communication @ 3 Networking @ 6 PyTorch @ 3 CUDA @ 1 GPU @ 6

Details

We are seeking a Distinguished Engineer for AI Resiliency at NVIDIA. In this role you will architect, design, and develop world-class software resiliency features for training groundbreaking AI models on the largest AI superclusters in the world. You will lead a team of cross-functional experts to drive and shape an end-to-end AI software stack, ensuring near-zero downtime for training on industry-leading frameworks like PyTorch and JAX/XLA. Your work will span algorithmic innovations and robust software architecture with direct exposure to NVIDIA senior leadership.

Responsibilities

  • Define a scalable software architecture to enable single-job resilient training on hundreds of thousands of GPUs with minimal downtime.
  • Design and deliver modular, resilient software features to support large-scale AI training for top customers.
  • Innovate and evolve resilient architecture designs to meet stringent uptime requirements (downtime < 1%), using approaches such as in-memory checkpointing, in-process restart, and anomaly/SDC detection.
  • Collaborate closely with internal partners, spearhead project execution, and communicate regular progress updates to senior leadership.
  • Lead and mentor cross-functional engineering teams across the AI software stack.

Requirements

  • Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience.
  • 15+ years of experience in software architecture or related fields, with deep understanding of AI-optimized systems.
  • At least 5 years of hands-on software development on high-complexity projects involving HPC or AI, ideally across the full lifecycle (design to deployment) of large-scale HPC systems.
  • Excellent collaboration and communication skills across multiple engineering teams.

Ways to Stand Out

  • Proven experience with large-scale AI supercomputing applications, particularly in the training phase.
  • 5+ years of experience using and contributing to modern AI frameworks like PyTorch and JAX/XLA for large-scale training workloads.
  • Strong passion and experience designing system architectures tailored for AI (CPU, GPU, memory, storage, networking).
  • Experience implementing HPC software development best practices in large-scale systems.
  • Experience collaborating with deep learning research teams, CUDA kernel and framework development, and/or silicon architecture teams is a plus.

Benefits

  • Base salary range: 308,000 USD - 471,500 USD (actual base salary determined by location, experience, and pay of employees in similar positions).
  • Eligible for equity and company benefits. (See NVIDIA benefits page for details.)

Additional Details

  • Applications accepted at least until July 29, 2025.
  • NVIDIA is an equal opportunity employer committed to diversity and inclusion.