Senior System Software Engineer  Data Center GPU Compute Diagnostics

at Nvidia
USD 224,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Linux @ 4 Python @ 7 Mentoring @ 4 Networking @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 AI @ 4 InfiniBand @ 4 NVLink @ 4

Details

We are seeking a senior system software engineer to work on next-generation data center GPU diagnostics for rack-scale AI supercomputer systems. The team builds applications and compute workloads that heavily stress GPU compute engines, HBM memory, cache hierarchy, PCIe/NVLink interfaces, power delivery, and thermal behavior for silicon/system bring-up, manufacturing, and customer use.

Responsibilities

  • Work closely with hardware architecture, driver, manufacturing, and field teams through the product development lifecycle of rack-scale AI systems.
  • Craft CUDA/C++ diagnostic workloads and software infrastructure required for new chip development, validation, productization, and field triage.
  • Design and implement GPU compute tests that stress Tensor Cores, SMs, L2/cache hierarchy, HBM memory, and related power/thermal operating points.
  • Develop and tune GEMM-style diagnostic workloads, including tests combined with additional load in NVLink, PCIe, or CPU subsystems.
  • Develop and integrate higher-level AI workload tests, including PyTorch-based large model workloads to stress GPUs, memory, interconnects, thermal behavior, and system software under realistic rack-scale AI use cases.
  • Assess new hardware features and architect manufacturing and field diagnostic tests using pre-beta GPU drivers, low-level diagnostic software, and system telemetry.
  • Debug failures involving ECC, HBM behavior, thermal limits, voltage/frequency margining, and PCIe/NVLink errors.

Requirements

  • BS or MS degree in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
  • 12+ years of system software, GPU software, embedded software, or hardware validation experience.
  • Experience driving technical work across multiple engineers, mentoring others, or leading development of a complex software component.
  • Experience writing diagnostics and stress tests that interface to low-level hardware drivers and hardware registers.
  • Strong C/C++ and Python programming skills.
  • Experience with Linux device drivers, CUDA kernels, GPU compute workloads, or related accelerator programming is strongly preferred.
  • Understanding of memory systems, ECC behavior, cache hierarchy, bandwidth bottlenecks, and hardware failure signatures.
  • Understanding of GEMM-style workloads and how workload shape, precision, runtime, and verification affect compute stress, power, memory, and thermal behavior.
  • Experience with voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
  • Background with PCIe, NVLink, or networking technologies such as InfiniBand and Ethernet.

Compensation

  • Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
  • You will also be eligible for equity and benefits.

Additional information

  • Applications accepted at least until May 22, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.