Senior Software Engineer, AI Resiliency

at Nvidia

📍 Redmond, United States

USD 184,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 4 CI/CD @ 4 Distributed Systems @ 4 TensorFlow @ 3 Parallel Programming @ 7 Debugging @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 AI @ 4 Profiling @ 4 NCCL @ 4 HPC @ 4 JAX @ 4

Details

You will join the AI Software Resiliency team working to define and implement resiliency features for AI supercomputers at extreme scale (100,000+ GPUs). The role focuses on reducing cluster downtime toward zero by building robust checkpoint/recovery, failure detection/isolation, and mitigation mechanisms for large-scale AI training and inference workloads.

Responsibilities

Implement and optimize AI software resiliency features such as fast checkpoint-recovery, error detection and isolation, and straggler/hang detection.
Contribute high-quality, production-level C++ and Python code for large-scale distributed systems and optimize performance for workloads running on thousands of GPUs.
Design and implement fault tolerance, detection of silent data corruption (SDC), and other failure handling techniques; develop monitoring tools for proactive failure mitigation.
Collaborate with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks (PyTorch, JAX/XLA) and broader stack components.
Develop tests and automation to ensure robustness, scalability, and efficiency of resiliency mechanisms; contribute to CI/CD pipelines for automated validation of AI workloads.
Support production deployments by debugging and performance tuning large-scale AI workloads in cloud and HPC environments.

Requirements

Bachelor's, Master's, or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
6+ years of relevant industry experience.
Proficiency in C++ and Python and experience writing efficient, high-performance production code.
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
Familiarity with AI frameworks (PyTorch, JAX/XLA, TensorFlow) and integrating resiliency into these frameworks.
Experience with debugging and profiling tools (examples: gdb, perf, valgrind, NVIDIA Nsight).
Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Ways to Stand Out

Hands-on experience training models or working with model training teams.
Experience with CUDA, NCCL, or MPI for GPU-accelerated computing, particularly at extreme scale.
Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training.
Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads.
Strong systems programming skills and experience with low-level performance tuning.

Compensation & Benefits

Base salary range: 184,000 USD - 287,500 USD (final base salary determined by location, experience, and comparable internal pay).
Eligible for equity and company benefits (link to NVIDIA benefits provided in posting).

Other Information

Applications accepted at least until May 16, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity.