Senior Site Reliability Engineer, Observability
at Nvidia
π Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
ElasticSearch @ 7 Go @ 1 Grafana @ 4 Kafka @ 4 Linux @ 4 Prometheus @ 7 Python @ 1 Spark @ 4 Java @ 1 Distributed Systems @ 4 Flink @ 4 Hiring @ 4 Networking @ 4 SRE @ 8 Thanos @ 7 Debugging @ 4 OpenTelemetry @ 7Details
NVIDIA sits at the center of the AI revolution, and the teams behind our data and observability platforms keep the whole engine running. We are hiring Site Reliability Engineers to design and run NVIDIA's global telemetry backbone β the platform that carries metrics, logs, traces, and profiling data for demanding workloads. You will shape how AI and data systems are built, set reliability standards, and solve scaling challenges that come with operating at NVIDIA's pace and scale.
Responsibilities
- Architect and operate large-scale observability systems that span global regions and support AI, data, and platform services.
- Design resilient pipelines for metrics, logs, traces, profiling, and events to keep critical systems visible and debuggable.
- Partner with platform, infrastructure, and application teams to establish telemetry standards, instrumentation patterns, and integration workflows.
- Automate deployments, scaling workflows, and maintenance tasks to reduce toil and raise operational maturity across the stack.
- Define and maintain SLOs, SLIs, error budgets, dashboards, and alerting models to guide reliability decisions company-wide.
- Build self-service tooling and frameworks that make observability easy to adopt for engineers across NVIDIA.
- Study system behavior to uncover bottlenecks, scaling limits, failure modes, and long-term architecture risks.
- Run day-to-day operations including upgrades, performance tuning, break/fix, and rotations to keep the platform healthy.
- Lead incident response and root-cause investigations and drive remediation to prevent recurrence.
- Guide engineers through design reviews, operational best practices, and reliability-focused decision making.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.
- 10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer.
- 5+ years designing, building, and running observability platforms at scale.
- Deep hands-on experience with open-source observability stacks (Prometheus/Thanos/Mimir for metrics; Loki or Elasticsearch/OpenSearch for logs; Tempo/Jaeger/OpenTelemetry for tracing and profiling).
- Strong programming ability in Python and Go; Java experience is a plus.
- Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering.
- Experience architecting multi-region, multi-tenant telemetry pipelines with high availability and durability guarantees.
- Proven skill optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies.
- Strong understanding of SLOs, SLIs, error budgets, incident response, and operational processes that support reliable systems.
- Ability to analyze complex distributed systems, pinpoint failure modes, and drive data-informed debugging and root-cause analysis.
- Clear communicator able to collaborate across product, platform, infrastructure, and application engineering teams.
Ways to Stand Out
- Designed or led the architecture of a global observability platform supporting thousands of services with strict reliability and performance requirements.
- Contributed meaningfully to OpenTelemetry, Prometheus, Grafana, or other major observability open-source projects.
- Built high-throughput ingestion pipelines and long-term storage systems with focus on cost efficiency, retention strategy, and query performance.
- Specialized in high-cardinality telemetry, multi-tenant isolation, and advanced retention or tiered storage models.
- Hands-on experience with Kafka, Spark, Flink, or large-scale collectors in ultra-high-scale production environments.
Compensation & Benefits
- Base salary range (determined by location, experience, and pay of employees in similar positions):
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits page).
Additional Information
- Applications for this job will be accepted at least until January 20, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.