Senior Site Reliability Engineer

at Nvidia
USD 148,000-276,000 per year
SENIOR
✅ On-site

Used Tools & Technologies

GPU

Required Skills & Competences

Security @ 7 Ansible @ 7 Docker @ 6 Grafana @ 4 Jenkins @ 7 Kubernetes @ 4 MySQL @ 4 Prometheus @ 4 Kibana @ 4 SQL @ 4 Distributed Systems @ 4 Communication @ 4 Networking @ 4 OpenStack @ 4 SRE @ 4 Planning @ 4 Splunk @ 4 Deep Learning @ 4 AI @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. As an NVIDIAN, you will be immersed in a diverse, supportive environment working on infrastructure that supports GPUs, Tegra systems, deep learning, AI and driverless cars.

Responsibilities

  • Manage NVIDIA's on-prem infrastructure across multiple data centers to maintain uptime, reliability, and readiness of engineering cloud environments.
  • Guard service level agreements (SLAs) for critical engineering services by implementing monitoring, alerting, and incident response procedures.
  • Perform root cause analysis and post-mortems for incidents and threshold breaches.
  • Deploy, configure, and manage applications and services on Kubernetes clusters; ensure high availability, fault tolerance, and disaster recovery for Kubernetes workloads.
  • Implement logging, monitoring, and alerting solutions (examples given: Prometheus, Grafana, ELK/EFK).
  • Drive automation of monitoring and operations to improve insight into application and system health.
  • Support user-reported issues, monitor alerts, participate in WAR rooms for critical incidents, and assist in capacity planning and optimization efforts.
  • Reuse AI techniques to extract useful signals from machine and job data where applicable.

Requirements

  • 5+ years of demonstrable experience in maintaining cloud infrastructure and highly-available production environments.
  • Experience handling and maintaining systems installed in on-premises data centers; hands-on proficiency with BMC interfaces (Redfish), KVM, and IPMI for hardware provisioning, remote access, and troubleshooting.
  • Knowledge of OpenStack architecture and services is a plus.
  • Experience with databases including relational (SQL/MySQL) and time-series databases (Prometheus); experience in data querying and performance tuning.
  • Solid understanding of networking principles and protocols (TCP/IP, DNS, DHCP, VLANs) and diagnosing connectivity in distributed systems.
  • Practical experience with data analytics and visualization tools such as Kibana, Grafana, Splunk, or similar platforms for logs and metrics analysis.
  • Strong experience with automation tools like Jenkins and/or Temporal and configuration tools like Ansible.
  • Proficiency with Kubernetes, Docker, and virtualization technologies for deploying and operating containerized workloads in production.
  • Advanced knowledge of standard security methodologies and protocols, including system hardening, access control, vulnerability management, and secure operations.
  • Bachelor’s degree in Computer Science, Information Technology, or related field, or equivalent experience.

Ways to stand out

  • Prior experience with SRE teams managing on-prem infrastructure.
  • Experience managing NVIDIA hardware such as GPUs and Tegra devices.
  • Ability to thrive in a multi-tasking environment with evolving priorities and excellent interpersonal and communication skills.

Compensation and benefits

  • Base salary ranges by level:
    • Level 3: 148,000 USD - 235,750 USD
    • Level 4: 176,000 USD - 276,000 USD
  • Eligible for equity and benefits. A generous benefits package is referenced on NVIDIA's benefits page.

Additional information

  • Applications accepted at least until June 13, 2026.
  • This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.