Senior Site Reliability Engineer - Observability and Telemetry Platform

at Nvidia

📍 Santa Clara, United States

USD 168,000-333,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Docker @ 4 Go @ 4 Grafana @ 1 Kubernetes @ 4 Linux @ 4 Prometheus @ 1 Ruby @ 4 Python @ 4 Distributed Systems @ 6 Communication @ 7 Mathematics @ 4 Networking @ 4 OpenStack @ 4 Perl @ 4 SRE @ 4 OpenTelemetry @ 1

Details

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role focuses on operational and reliability aspects of observability and telemetry collection platforms, ensuring high availability, performance at scale, and enabling developers to make changes safely through automation and engineering practices.

Responsibilities

Design, implement and support operational and reliability aspects of large-scale Observability & Telemetry collection platforms with a focus on performance at scale, real-time monitoring, logging and alerting.
Engage in and improve the whole lifecycle of services — from inception and design through deployment, operation and refinement.
Support services before they go live via system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
Maintain services once live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through automation and evolve systems by driving changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Participate in an on-call rotation to support production systems.

Requirements

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
5+ years of experience with infrastructure automation and distributed systems design; experience designing and developing tools for running large-scale private or public cloud systems in production.
8+ years experience delivering foundational infrastructure and observability platforms.
Experience in one or more of: Python, Go, Perl or Ruby.
In-depth knowledge of Linux, networking and containers.
Experience with cloud and large-scale systems (private or public), Kubernetes, OpenStack and Docker is valued.
Experience running or using observability tools such as Grafana, OpenTelemetry, Prometheus, and similar tools is a plus.
Strong problem-solving skills, communication, sense of ownership, ability to debug and optimize code and automate routine tasks.

Ways to stand out

Interest in crafting, analyzing and fixing large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and ownership.
Experience operating large private and public cloud systems based on Kubernetes, OpenStack and Docker, and experience with Grafana, OpenTelemetry, Prometheus or similar observability tools.

Compensation & Benefits

Base salary ranges:
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
You will also be eligible for equity and benefits (see NVIDIA benefits page).

Additional information

Applications for this job will be accepted at least until August 13, 2025.
NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.