Senior Software Engineer, Observability
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 Distributed Systems @ 4 Networking @ 4 Microservices @ 7 Thanos @ 4 Debugging @ 4 API @ 4 System Architecture @ 7 OpenTelemetry @ 4 GPU @ 4Details
NVIDIA's Observability team is seeking a Senior/Staff Engineer to compose and build the next-generation, multi-region observability platform. This platform powers NVIDIA's AI, Data, and Observability ecosystem at immense scale β processing trillions of metrics, hundreds of terabytes of logs, and billions of distributed traces daily across high-performance datacenters and multi-cloud environments. This is a high-impact, architecture-heavy, code-first role owning the unified observability stack (metrics, logs, traces, profiles, analytics). Ownership spans the entire telemetry pipeline: ingestion, storage, query routing, governance, multi-tenant isolation, GPU-accelerated analytics, and real-time insights.
You will collaborate across GPU Compute, Distributed Systems, Networking, ML Infra, AI Platform, and Cloud Services to shape NVIDIA's global observability strategy and ensure engineers have deep access to system health, performance, and debugging signals.
Responsibilities
- Design and operate scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments.
- Architect end-to-end observability pipelines including ingestion, storage, querying, and visualization.
- Extend monitoring and alerting using Prometheus, Alertmanager, Thanos/Mimir, Grafana, and OpenTelemetry.
- Build scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/OpenSearch stacks.
- Implement distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrate with service meshes, load balancers, and APIs.
- Define and drive adoption of SLOs, SLIs, and error budgets across services and teams.
- Automate provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python).
- Ensure reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure).
- Embed security guidelines into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls.
- Mentor engineers and help shape NVIDIA's observability strategy and technical roadmap.
Requirements
- 8+ years of experience with distributed systems, focusing on observability and monitoring systems.
- BS or MS in EE, ECE, CS, or equivalent experience.
- Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/OpenSearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry).
- Strong programming skills in Go or Python for automation, operators, and custom integrations.
- Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments.
- Demonstrated skill in crafting, optimizing, and scaling telemetry pipelines handling high cardinality and high-efficiency data.
- Solid understanding of distributed systems, performance engineering, and debugging complex workloads.
- Familiarity with service meshes, networking, and workload instrumentation (Envoy, Istio, OpenTelemetry SDKs).
- Collaboration skills and the ability to influence engineering teams to adopt observability guidelines.
Ways to stand out
- Strong background in software design and system architecture with excellent coding skills in Go and Python for building telemetry pipelines and microservices.
- Experience operating large-scale, multi-region observability platforms (metrics, logs, traces).
- Proven track record meeting strict SLAs for high-throughput, high-cardinality distributed systems.
- Deep expertise in data platforms and telemetry ingestion, storage, and querying layers.
- Demonstrated success collaborating across teams to drive platform improvements and standardization.
Compensation & Benefits
- Base salary range:
- Level 4: 184,000 USD β 287,500 USD
- Level 5: 224,000 USD β 356,500 USD
- Eligible for equity and company benefits.
Additional information
- Applications accepted at least until December 12, 2025.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.