Senior System Software Engineer Data Center GPU Compute Diagnostics

at Nvidia

📍 Durham, United States

USD 224,000-356,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Linux @ 4 Python @ 7 Mentoring @ 4 Networking @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 AI @ 4 InfiniBand @ 4 NVLink @ 4

Details

We are seeking a senior system software engineer to work on next-generation data center GPU diagnostics for rack-scale AI supercomputer systems. The team builds applications and compute workloads that heavily stress GPU compute engines, HBM memory, cache hierarchy, PCIe/NVLink interfaces, power delivery, and thermal behavior for silicon/system bring-up, manufacturing, and customer use.

Responsibilities

Work closely with hardware architecture, driver, manufacturing, and field teams through the product development lifecycle of rack-scale AI systems.
Craft CUDA/C++ diagnostic workloads and software infrastructure required for new chip development, validation, productization, and field triage.
Design and implement GPU compute tests that stress Tensor Cores, SMs, L2/cache hierarchy, HBM memory, and related power/thermal operating points.
Develop and tune GEMM-style diagnostic workloads, including tests combined with additional load in NVLink, PCIe, or CPU subsystems.
Develop and integrate higher-level AI workload tests, including PyTorch-based large model workloads to stress GPUs, memory, interconnects, thermal behavior, and system software under realistic rack-scale AI use cases.
Assess new hardware features and architect manufacturing and field diagnostic tests using pre-beta GPU drivers, low-level diagnostic software, and system telemetry.
Debug failures involving ECC, HBM behavior, thermal limits, voltage/frequency margining, and PCIe/NVLink errors.

Requirements

BS or MS degree in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
12+ years of system software, GPU software, embedded software, or hardware validation experience.
Experience driving technical work across multiple engineers, mentoring others, or leading development of a complex software component.
Experience writing diagnostics and stress tests that interface to low-level hardware drivers and hardware registers.
Strong C/C++ and Python programming skills.
Experience with Linux device drivers, CUDA kernels, GPU compute workloads, or related accelerator programming is strongly preferred.
Understanding of memory systems, ECC behavior, cache hierarchy, bandwidth bottlenecks, and hardware failure signatures.
Understanding of GEMM-style workloads and how workload shape, precision, runtime, and verification affect compute stress, power, memory, and thermal behavior.
Experience with voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
Background with PCIe, NVLink, or networking technologies such as InfiniBand and Ethernet.

Compensation

Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
You will also be eligible for equity and benefits.

Additional information

Applications accepted at least until May 22, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.