Senior Software Engineer, AI Resiliency

at Nvidia
USD 184,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Python @ 4 CI/CD @ 4 Distributed Systems @ 4 TensorFlow @ 3 Parallel Programming @ 7 Debugging @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 AI @ 4 Profiling @ 4 NCCL @ 4 HPC @ 4 JAX @ 4

Details

You will join the AI Software Resiliency team working to define and implement resiliency features for AI supercomputers at extreme scale (100,000+ GPUs). The role focuses on reducing cluster downtime toward zero by building robust checkpoint/recovery, failure detection/isolation, and mitigation mechanisms for large-scale AI training and inference workloads.

Responsibilities

  • Implement and optimize AI software resiliency features such as fast checkpoint-recovery, error detection and isolation, and straggler/hang detection.
  • Contribute high-quality, production-level C++ and Python code for large-scale distributed systems and optimize performance for workloads running on thousands of GPUs.
  • Design and implement fault tolerance, detection of silent data corruption (SDC), and other failure handling techniques; develop monitoring tools for proactive failure mitigation.
  • Collaborate with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks (PyTorch, JAX/XLA) and broader stack components.
  • Develop tests and automation to ensure robustness, scalability, and efficiency of resiliency mechanisms; contribute to CI/CD pipelines for automated validation of AI workloads.
  • Support production deployments by debugging and performance tuning large-scale AI workloads in cloud and HPC environments.

Requirements

  • Bachelor's, Master's, or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
  • 6+ years of relevant industry experience.
  • Proficiency in C++ and Python and experience writing efficient, high-performance production code.
  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
  • Familiarity with AI frameworks (PyTorch, JAX/XLA, TensorFlow) and integrating resiliency into these frameworks.
  • Experience with debugging and profiling tools (examples: gdb, perf, valgrind, NVIDIA Nsight).
  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Ways to Stand Out

  • Hands-on experience training models or working with model training teams.
  • Experience with CUDA, NCCL, or MPI for GPU-accelerated computing, particularly at extreme scale.
  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training.
  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads.
  • Strong systems programming skills and experience with low-level performance tuning.

Compensation & Benefits

  • Base salary range: 184,000 USD - 287,500 USD (final base salary determined by location, experience, and comparable internal pay).
  • Eligible for equity and company benefits (link to NVIDIA benefits provided in posting).

Other Information

  • Applications accepted at least until May 16, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity.