Senior System Software Engineer Data Center GPU Compute Diagnostics
at Nvidia
USD 224,000-356,500 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Linux @ 4
Python @ 7
Mentoring @ 4
Networking @ 4
PyTorch @ 4
CUDA @ 4
GPU @ 4
AI @ 4
InfiniBand @ 4
NVLink @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
We are seeking a senior system software engineer to work on next-generation data center GPU diagnostics for rack-scale AI supercomputer systems. The team builds applications and compute workloads that heavily stress GPU compute engines, HBM memory, cache hierarchy, PCIe/NVLink interfaces, power delivery, and thermal behavior for silicon/system bring-up, manufacturing, and customer use.
Responsibilities
- Work closely with hardware architecture, driver, manufacturing, and field teams through the product development lifecycle of rack-scale AI systems.
- Craft CUDA/C++ diagnostic workloads and software infrastructure required for new chip development, validation, productization, and field triage.
- Design and implement GPU compute tests that stress Tensor Cores, SMs, L2/cache hierarchy, HBM memory, and related power/thermal operating points.
- Develop and tune GEMM-style diagnostic workloads, including tests combined with additional load in NVLink, PCIe, or CPU subsystems.
- Develop and integrate higher-level AI workload tests, including PyTorch-based large model workloads to stress GPUs, memory, interconnects, thermal behavior, and system software under realistic rack-scale AI use cases.
- Assess new hardware features and architect manufacturing and field diagnostic tests using pre-beta GPU drivers, low-level diagnostic software, and system telemetry.
- Debug failures involving ECC, HBM behavior, thermal limits, voltage/frequency margining, and PCIe/NVLink errors.
Requirements
- BS or MS degree in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
- 12+ years of system software, GPU software, embedded software, or hardware validation experience.
- Experience driving technical work across multiple engineers, mentoring others, or leading development of a complex software component.
- Experience writing diagnostics and stress tests that interface to low-level hardware drivers and hardware registers.
- Strong C/C++ and Python programming skills.
- Experience with Linux device drivers, CUDA kernels, GPU compute workloads, or related accelerator programming is strongly preferred.
- Understanding of memory systems, ECC behavior, cache hierarchy, bandwidth bottlenecks, and hardware failure signatures.
- Understanding of GEMM-style workloads and how workload shape, precision, runtime, and verification affect compute stress, power, memory, and thermal behavior.
- Experience with voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
- Background with PCIe, NVLink, or networking technologies such as InfiniBand and Ethernet.
Compensation
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
- You will also be eligible for equity and benefits.
Additional information
- Applications accepted at least until May 22, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.