Senior Software QA Test Development Engineer - Diagnostics

at Nvidia
USD 140,000-270,200 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 4 Ansible @ 4 CentOS @ 7 Docker @ 1 Jenkins @ 4 Kubernetes @ 1 Linux @ 7 DevOps @ 4 Python @ 4 Java @ 4 GitHub @ 1 CI/CD @ 7 TensorFlow @ 4 JavaScript @ 4 Parallel Programming @ 4 Debugging @ 4 QA @ 4 NLP @ 4 LLM @ 4 PyTorch @ 4 Agile @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is the world leader in GPU computing and positions itself as the 'AI Computing Company.' This role is for a Senior Software QA Test Development Engineer on the platform SWQA team focusing on HGX/DGX/MGX platform diagnostics and reliability testing across servers, OS, firmware and the CUDA software stack. The role requires strong Linux and server-level troubleshooting, experience with automation and CI/CD, and hands-on knowledge of AI frameworks and model testing.

Responsibilities

  • Develop and execute NVIDIA HGX/DGX/MGX platform test plans for servers, OS, firmware and CUDA software stack based on design documents.
  • Install and test various system OS versions, server firmware and software stacks.
  • Drive root cause analysis for reliability and validation test failures; identify root causes and implement mitigations.
  • Build, develop, and debug server- and OS-level automation front-end and back-end frameworks and tests.
  • Review partner and supplier test results and recommend additional reliability testing for components, servers, and packaging as needed.
  • Work within an agile software development team with high production quality standards.
  • Manage bug lifecycle and collaborate cross-functionally to drive solutions.

Requirements

  • Bachelor’s degree (or equivalent experience) in a STEM field (Science, Technology, Engineering, Math or Physics).
  • 5+ years proven experience (or Master’s degree).
  • Proven experience in OS and server-level automation, CI/CD processes and DevOps using: Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript.
  • Strong server and Linux (Ubuntu, RedHat, CentOS, SUSE, Fedora, etc.) troubleshooting and debugging experience in bare-metal and KVM/VMWare/Hyper-V environments.
  • Good knowledge and hands-on experience in model testing, AI tools/frameworks (TensorFlow, PyTorch, Cursor and etc.), NLP and LLM benchmarking.
  • Experience using AI development tools for test plan creation, test case development and test case automation.
  • Strong experience with firmware (FW), BMC/OpenBMC, network protocols, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, and Redfish is a huge plus.
  • Proven experience with GitHub/GitLab/Gerrit, PXE, SLURM, and container/orchestration tooling (Kubernetes/Docker) is a huge plus.

Ways to stand out

  • Experience with AI-related tools, LLMs and NLP.
  • Experience working with NVIDIA GPU hardware.
  • Solid understanding of virtualization in Linux (KVM) and container orchestration (Docker with Kubernetes).
  • Background in parallel programming, ideally CUDA/OpenCL.

Compensation & Benefits

  • Base salary ranges by level: Level 3: 140,000 USD - 224,250 USD; Level 4: 168,000 USD - 270,250 USD. Your base salary will be determined based on location, experience, and pay of employees in similar positions.
  • Eligible for equity and company benefits (see NVIDIA benefits).

Additional information

  • Applications accepted at least until January 14, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and values diversity; it does not discriminate on legally protected bases.