Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 CentOS @ 7 Docker @ 1 Jenkins @ 4 Kubernetes @ 1 Linux @ 7 DevOps @ 4 Python @ 4 Java @ 4 GitHub @ 1 CI/CD @ 4 TensorFlow @ 4 JavaScript @ 4 Networking @ 4 Parallel Programming @ 4 Debugging @ 7 NLP @ 4 LLM @ 4 PyTorch @ 4 Agile @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is the world leader in GPU Computing, passionate about markets including gaming, automotive, vision, HPC, datacenters, and networking. Positioned as the 'AI Computing Company,' NVIDIA GPUs power deep learning software frameworks, analytics, data centers, and autonomous vehicles. This role involves working in a diverse environment with a focus on continuous process improvement.
Responsibilities
- Develop and execute NVIDIA HGX/DGX/MGX platform test plans on servers, OS, firmware, and CUDA software stack based on design documents.
- Install and test various system OS, server firmware, and software stacks.
- Drive root cause analysis on reliability and validation test failures to identify and mitigate issues.
- Build, develop, and debug server and OS level automation frameworks and tests.
- Review partner and supplier test results and prescribe additional reliability testing as needed.
- Work in an agile team maintaining high production quality standards.
- Manage bug lifecycle and collaborate with different groups to find solutions.
Requirements
- Bachelor's Degree or equivalent in STEM field; 5+ years experience or a master's degree.
- Proven experience in OS and server-level automation, CI/CD processes, and DevOps using Python, Shell, Ansible, Jenkins, C/C++, Java, JavaScript.
- Strong troubleshooting and debugging skills with Linux (Ubuntu, RedHat, CentOS, SuSE, Fedora, etc.) in bare-metal and virtualized environments (KVM, VMware, Hyper-V).
- Hands-on experience with model testing, AI tools/frameworks (TensorFlow, PyTorch, Cursor), NLP, and LLM benchmarking.
- Use of AI development tools for test plan creation and test case automation.
- Knowledge of firmware, BMC/OpenBMC, network protocols, enterprise storage, PCIe, IO sub-devices, CPU, memory, ACPI, UEFI specs, Redfish is a plus.
- Experience with GitHub/GitLab/Gerrit, PXE, SLURM, Kubernetes, Docker is a significant advantage.
Ways to Stand Out
- Experience with AI tools, LLM, NLP.
- Experience working with NVIDIA GPU hardware.
- Understanding of Linux virtualization (KVM, Docker, Kubernetes).
- Background in parallel programming such as CUDA/OpenCL.
Benefits
- Competitive salary range from 136,000 USD to 264,500 USD, based on location and experience.
- Eligibility for equity and additional benefits.
- Commitment to diversity and equal opportunity employment.