Distinguished Engineer, AI Resiliency

at Nvidia

📍 Santa Clara, United States

USD 308,000-471,500 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 5 Leadership @ 3 Communication @ 3 Networking @ 6 PyTorch @ 3 CUDA @ 1 GPU @ 6

Details

We are seeking a Distinguished Engineer for AI Resiliency at NVIDIA. In this role you will architect, design, and develop world-class software resiliency features for training groundbreaking AI models on the largest AI superclusters in the world. You will lead a team of cross-functional experts to drive and shape an end-to-end AI software stack, ensuring near-zero downtime for training on industry-leading frameworks like PyTorch and JAX/XLA. Your work will span algorithmic innovations and robust software architecture with direct exposure to NVIDIA senior leadership.

Responsibilities

Define a scalable software architecture to enable single-job resilient training on hundreds of thousands of GPUs with minimal downtime.
Design and deliver modular, resilient software features to support large-scale AI training for top customers.
Innovate and evolve resilient architecture designs to meet stringent uptime requirements (downtime < 1%), using approaches such as in-memory checkpointing, in-process restart, and anomaly/SDC detection.
Collaborate closely with internal partners, spearhead project execution, and communicate regular progress updates to senior leadership.
Lead and mentor cross-functional engineering teams across the AI software stack.

Requirements

Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience.
15+ years of experience in software architecture or related fields, with deep understanding of AI-optimized systems.
At least 5 years of hands-on software development on high-complexity projects involving HPC or AI, ideally across the full lifecycle (design to deployment) of large-scale HPC systems.
Excellent collaboration and communication skills across multiple engineering teams.

Ways to Stand Out

Proven experience with large-scale AI supercomputing applications, particularly in the training phase.
5+ years of experience using and contributing to modern AI frameworks like PyTorch and JAX/XLA for large-scale training workloads.
Strong passion and experience designing system architectures tailored for AI (CPU, GPU, memory, storage, networking).
Experience implementing HPC software development best practices in large-scale systems.
Experience collaborating with deep learning research teams, CUDA kernel and framework development, and/or silicon architecture teams is a plus.

Benefits

Base salary range: 308,000 USD - 471,500 USD (actual base salary determined by location, experience, and pay of employees in similar positions).
Eligible for equity and company benefits. (See NVIDIA benefits page for details.)

Additional Details

Applications accepted at least until July 29, 2025.
NVIDIA is an equal opportunity employer committed to diversity and inclusion.