Senior Site Reliability Engineer - Observability And Telemetry Platform

at Nvidia
USD 144,000-270,200 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Ruby @ 4 Python @ 4 Distributed Systems @ 6 Communication @ 7 Networking @ 4 OpenStack @ 4 Perl @ 4 SRE @ 4 Debugging @ 4 OpenTelemetry @ 4 GPU @ 4

Details

Site Reliability Engineering (SRE) at NVIDIA involves designing, building, and maintaining large-scale production systems with high efficiency and availability by leveraging software and systems engineering practices. This discipline requires expertise in systems, networking, coding, database, capacity management, continuous delivery, deployment, and cloud technologies like Kubernetes and OpenStack. SRE ensures GPU cloud services achieve maximum reliability and uptime while enabling developers to safely update the system.

Much of the work focuses on automation, performance tuning, and optimization to reduce manual work and improve system efficiency. The team values diversity, intellectual curiosity, problem-solving, open collaboration, and a blame-free culture that encourages learning and growth.

Responsibilities

  • Design, implement, and support a large-scale Observability & Telemetry collection platform focusing on performance at scale, real-time monitoring, logging, and alerting.
  • Improve the entire service lifecycle from inception, design, deployment, operation, to refinement.
  • Provide design consulting, develop software tools/platforms, manage capacity, and conduct launch reviews to support services before going live.
  • Monitor and maintain live services measuring availability, latency, and system health.
  • Scale systems sustainably with automation and push for improvements in reliability and velocity.
  • Participate in sustainable incident response and blameless postmortems.
  • Participate in on-call rotations for production system support.

Requirements

  • BS degree in Computer Science or related technical field or equivalent experience.
  • 5+ years of experience in infrastructure automation, distributed systems design, and developing tools for large-scale private or public cloud production systems.
  • 5+ years delivering foundational infrastructure and observability platforms.
  • Experience with Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, networking, and containers.

Ways To Stand Out

  • Interest in analyzing, crafting, and fixing large-scale distributed systems.
  • Systematic problem-solving skills, strong communication, ownership, and drive.
  • Experience debugging, optimizing code, and automating routine tasks.
  • Experience with Kubernetes, OpenStack, Docker, and observability tools like Grafana, OpenTelemetry, Prometheus.

Compensation

  • Base salary range for Level 3: $144,000 - $230,000 USD.
  • Base salary range for Level 4: $168,000 - $270,250 USD.
  • Eligibility for equity and benefits.

NVIDIA fosters a diverse and inclusive work environment and is committed to equal opportunity employment.