Senior Site Reliability Engineer

at Nvidia

📍 Santa Clara, United States

USD 168,000-333,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 Chef @ 4 Docker @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 GCP @ 4 Java @ 4 AWS @ 4 Azure @ 4 Git @ 4 Networking @ 4 SRE @ 4 Android @ 4 CUDA @ 4

Details

NVIDIA is looking for a Senior Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process). IPP is a global organization within NVIDIA. This group works with various other groups within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure needs. These cloud services provide almost half a million automated jobs per day on thousands of servers helping with the productivity of thousands of NVIDIA's software engineers worldwide.

The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows / Linux / Android), a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors. The role focuses on building and maintaining cloud services, designing creative solutions, mining operational data to uncover problems and fixing them to improve developer productivity and infrastructure reliability.

Responsibilities

Develop frameworks and scripts to automate workflows and deployments in the cloud environment.
Deploy and maintain a large farm of machines using Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).
Develop extensive monitoring systems to provide fast, reliable, real-time visibility of infrastructure subsystems (Zabbix, Grafana, Prometheus).
Participate in on-call and rotational L1 support for round-the-clock monitoring and remediation of the infrastructure.
Solve complex problems involving infrastructure scaling, capacity planning, and resiliency. Analyze and debug operating system, networking, configuration and performance issues.
Assist in rollout and deployment of new development features to support the latest NVIDIA hardware and technologies.
Develop SRE agents to streamline daily operational activities, reduce toil and improve operational efficiency.

Requirements

Bachelor's or Master's degree in Computer Science or Software Engineering, or equivalent experience.
Demonstrable experience (8+ years of operational experience required) working in large scale enterprise production systems.
Familiarity with implementing load balancing strategies, disaster recovery planning, business continuity best practices, and designing scalable, resilient systems based on SRE principles.
Ability to debug and analyze source code to triage, root cause and resolve infrastructure issues and work with development teams to improve build and test systems.
Hands-on coding experience with Python and/or Go. Proficiency with Unix shell scripting. Knowledge of Java and C.
Experience with version control systems such as Perforce and GIT.

Ways to stand out from the crowd

Experience with public clouds (AWS, GCP, Azure) and VM/container virtualization technologies such as VMware, KVM, Docker and Kubernetes.
Background with automating bare metal and VM provisioning.
Experience with supporting GPUs, embedded device development, driver development and CUDA / TensorRT applications.

Compensation & Benefits

Base salary ranges quoted by level:
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
You will also be eligible for equity and benefits.

Other information

Applications for this job will be accepted at least until September 14, 2025.
NVIDIA is an equal opportunity employer and values diversity in its workforce.