Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Python @ 4
Networking @ 4
Debugging @ 6
System Architecture @ 4
LLM @ 4
CUDA @ 6
GPU @ 4
Deep Learning @ 4
AI @ 4
vLLM @ 4
Slurm @ 4
TensorRT @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a Senior Validation Engineer on the DGX Server Product Engineering Team to work with HW/SW engineers to develop and implement complex automated test plans for GPU-accelerated computing products. The role focuses on system architecture, performance modeling, GPU SKU bring up, validation, and using industry-leading Deep Learning/AI applications for system-level stress and performance testing. The position requires on-site work in a hardware lab environment 5 days a week in Santa Clara, CA.
Responsibilities
- System architecture, design, performance modelling and estimation across new models and new packages.
- Enable GPU SKU bring up, validation and model enablement.
- Develop system-level stress and performance testing strategies using industry-leading Deep Learning/AI applications.
- Work with HW/SW engineers to develop and implement complex automated test plans for GPU-accelerated computing products.
Requirements
- Ability to work on site in hardware lab environment 5 days a week (Santa Clara, CA).
- BSEE or BSCE or equivalent experience.
- 5+ years of experience validating and debugging complex systems.
- Experience developing/running real-world ML/LLM workloads.
- Mandatory skills: Dynamo, TensorRT, Slurm, BCM.
- Preferred: Knowledge of vLLM and SG Lang.
- Proficiency in CUDA, cuBLAS and CUTLASS.
- Deep understanding of computing architectures.
- Coding experience with Python and running simulators.
- Experience with datacenter products including system management, security, networking, and storage.
Ways to stand out
- Background with x86/Arm server architectures and accelerated GPU computing.
- Track record of continuous process improvement with a passion for tools and automation.
Compensation & Benefits
- Base salary range (determined by location, experience, and comparable employees):
- Level 3: 136,000 USD - 212,750 USD
- Level 4: 168,000 USD - 258,750 USD
- Eligible for equity and benefits. See www.nvidiabenefits.com for details.
Other
- Applications accepted at least until April 5, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.