Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Software Development @ 4
Ansible @ 4
CentOS @ 7
Docker @ 3
Jenkins @ 4
Kubernetes @ 3
Linux @ 7
DevOps @ 7
Python @ 4
Java @ 4
GitHub @ 3
CI/CD @ 4
TensorFlow @ 4
JavaScript @ 4
Parallel Programming @ 4
Debugging @ 4
NLP @ 7
LLM @ 4
PyTorch @ 4
Agile @ 4
CUDA @ 4
GPU @ 4
AI @ 4
OpenCL @ 4
Slurm @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is the world leader in GPU computing, positioned as an AI Computing Company. This role is for a candidate with enterprise server integration, strong Linux experience, reliability testing with telemetry, scale-out cluster experience, test plan development, AI tools/NLP background, and DevOps/CI-CD experience to join the platform SWQA team.
Responsibilities
- Develop and execute NVIDIA HGX/DGX/MGX platform test plans on servers, OS, firmware and CUDA software stack from design documentation.
- Install and test various system OS, server firmware and software stacks.
- Drive root cause analysis for reliability and validation test failures and implement mitigations.
- Build, develop and debug server- and OS-level automation front-end and back-end frameworks and tests.
- Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.
- Work in an agile software development team with high production quality standards.
- Manage bug lifecycle and collaborate across groups to drive for solutions.
Requirements
- Bachelor’s degree (or equivalent experience) in a STEM field. Master’s degree or 5+ years of proven experience preferred.
- Proven experience in OS and server-level automation, CI/CD processes and DevOps using technologies such as Python, Shell, Ansible, Jenkins, C/C++, Java, JavaScript.
- Strong server and Linux (Ubuntu, RedHat, CentOS, SuSE, Fedora, etc.) troubleshooting and debugging experience in bare-metal and KVM/VMWare/Hyper-V environments.
- Hands-on experience in model testing and AI frameworks/tools (TensorFlow, PyTorch, Cursor, etc.), plus NLP and LLM benchmarking.
- Experience using AI development tools for test plan creation, test case development and test case automation.
- Experience with firmware (FW), BMC/OpenBMC, network protocols, enterprise storage devices, PCIe buses/devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, and Redfish is a strong plus.
- Familiarity with GitHub/GitLab/Gerrit, PXE, SLURM, Kubernetes, Docker, and container/orchestration tooling is a plus.
Ways to Stand Out
- Experience with AI-related tools, LLMs and NLP.
- Experience working with NVIDIA GPU hardware is a strong plus.
- Solid understanding of virtualization in Linux (KVM, Docker orchestrated with Kubernetes).
- Background in parallel programming, ideally CUDA/OpenCL.
Compensation
- Base salary ranges by level:
- Level 3: 140,000 USD - 224,250 USD
- Level 4: 168,000 USD - 270,250 USD
- Eligible for equity and benefits.
Benefits
- Eligible for equity and company benefits (see https://www.nvidia.com/en-us/benefits/).
Additional Information
- Applications accepted at least until February 28, 2026.
- This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is an equal opportunity employer and values diversity.