Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 4 Ansible @ 4 CentOS @ 7 Docker @ 1 Jenkins @ 4 Kubernetes @ 1 Linux @ 7 DevOps @ 4 Python @ 4 Java @ 4 GitHub @ 1 CI/CD @ 4 TensorFlow @ 4 JavaScript @ 4 Networking @ 4 Parallel Programming @ 4 Debugging @ 4 NLP @ 7 LLM @ 4 PyTorch @ 4 Agile @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is the world leader in GPU computing across markets including gaming, automotive, vision, HPC, datacenters and networking. NVIDIA GPUs power deep learning frameworks, analytics, data centers, and autonomous vehicle systems. The company seeks an experienced engineer with strong enterprise server integration, Linux expertise, reliability testing experience, scale-out cluster knowledge, test plan development experience, and background with AI tools/NLP. The role sits on the platform SWQA team and requires experience with DevOps and CI/CD.
Responsibilities
- Develop and execute NVIDIA HGX/DGX/MGX platform test plans for servers, OS, firmware, and CUDA software stack based on design documentation.
- Install and test various system operating systems, server firmware, and software stacks.
- Drive root cause analysis for reliability and validation test failures and define mitigations.
- Build, develop, and debug server- and OS-level automation frameworks (front-end and back-end) and tests.
- Review partner and supplier test results and recommend additional reliability testing for components, servers, and packaging as needed.
- Work in an agile software development team with high production-quality standards.
- Manage bug lifecycle and collaborate across groups to drive solutions.
Requirements
- Bachelor’s Degree (or equivalent experience) in a STEM field; master’s degree or 5+ years proven experience is acceptable.
- Proven experience with OS and server-level automation, CI/CD processes and DevOps using technologies such as Python, SHELL, Ansible, Jenkins, C/C++, Java, and JavaScript.
- Strong server and Linux (Ubuntu, RedHat, CentOS, SuSE, Fedora, etc.) troubleshooting and debugging experience in bare-metal and virtualized environments (KVM / VMWare / Hyper-V).
- Experience with reliability testing, telemetry, scale-out clusters, and test plan development.
- Knowledge and hands-on experience in model testing and AI frameworks/tools (TensorFlow, Pytorch, Cursor, etc.), plus NLP and LLM benchmarking.
- Experience using AI development tools for test plan creation, test case development, and test case automation.
- Strong experience with firmware (FW), BMC/OpenBMC, network protocols, enterprise storage devices, PCIe buses/devices, IO sub-devices, CPU and memory, ACPI, UEFI spec; Redfish is a big plus.
- Experience with version control and code review tools (GitHub / GitLab / Gerrit), PXE, SLURM, and container/orchestration technologies (Docker, Kubernetes) is a strong plus.
Ways to stand out
- Prior experience with AI-related tools, LLMs and NLP work.
- Experience working with NVIDIA GPU hardware.
- Solid understanding of virtualization in Linux (KVM) and container orchestration (Docker + Kubernetes).
- Background in parallel programming, ideally CUDA/OpenCL.
Compensation & Benefits
- Base salary ranges by level:
- Level 3: 136,000 USD - 212,750 USD per year
- Level 4: 168,000 USD - 264,500 USD per year
- You will also be eligible for equity and benefits.
Additional info
- Applications accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.