Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 4 Ansible @ 4 Chef @ 4 Docker @ 4 Kubernetes @ 4 Linux @ 4 MySQL @ 4 DevOps @ 4 IaC @ 4 Terraform @ 4 Python @ 4 SQL @ 4 GCP @ 4 Java @ 4 CI/CD @ 4 AWS @ 4 Azure @ 4 Git @ 4 Android @ 4 Debugging @ 4 Puppet @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is looking for a Senior DevOps Engineer to work in IPP (Infrastructure, Planning and Process) Sanity Engineering, to execute on Nvidia product bringups. IPP is a core software infrastructure organization within NVIDIA. This group works with various other groups within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars to cater to their infrastructure needs. These cloud services provide almost half a million automated jobs per day on thousands of distributed datacenters helping with the productivity of thousands of NVIDIA's software engineers worldwide.
The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android), a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors.
Responsibilities
- Lead end to end infrastructure bringup execution of new Nvidia GPU product.
- Have a thorough understanding of Nvidia GPU hardware and display driver stack, SBIOS, VBIOS and enhance automation for farm wide updates.
- Solve complex problems on groundbreaking pre-release products, leading GPU product bringups (PCIe & Enterprise) in infrastructure, integrating GPU test suites to infrastructure harness, and multi-site distributed infrastructure scaling.
- Optimize farm utilization of GPU resources by identifying the right set of regression test coverage.
- Champion config automation using world-class configuration management & infrastructure automation (IaC) tools like Chef, Puppet, Ansible, Terraform.
- Execute on bringup of specialized products used for accelerated computing and AI in fast-paced and critically important environments.
- Lead service charter and be responsible pilot for development, telemetry, and automation of bringup infrastructure.
- Automate and tune performance of regression test frameworks; create self-healing/automated recovery solutions for multi-geo regression farms.
- Engage with collaborators and partner teams to deliver onboarding of new products in CI/CD.
- Implement multiple parallel bringups in Nvidia Product Bringup landscape.
Requirements
- Bachelor’s or Master’s Degree in Computer Science or Software Engineering, or equivalent experience.
- 10+ years of relevant experience.
- Hands-on coding and debugging, cross compiling source code on various platforms; triage, root cause and resolve issues in bringup infrastructure.
- Familiar with maintenance and setup of Linux, Windows (x64 and ARM) hosts, VM, and container environments.
- Programming experience in Python (preferred), Java, etc.
- Proficiency in Unix & TCL shell.
- Experience in MySQL/No-SQL with ability to write complex queries.
- Experience with version control systems like Perforce, GIT.
- 7+ years of development and operations experience in large-scale enterprise production systems.
Ways to stand out
- Experience automating bare metal and VM provisioning.
- Prior knowledge of VM isolation for GPUs and Nvidia Confidential Computing.
- Experience with public clouds (AWS, GCP, Azure), VM and container virtualization technologies like VMware, KVM, HyperV, Docker, Kubernetes.
- Experience debugging GPU performance issues, embedded device software development and automation, software driver development, and CUDA/TensorRT applications.
Benefits
Competitive salaries, equity, and a generous benefits package. NVIDIA is an equal opportunity employer committed to fostering diversity.
Base salary range is 168,000 USD - 333,500 USD, determined by location, experience, and pay of employees in similar positions.