Engineering Manager, Observability Platform

at Nvidia

📍 Santa Clara, United States

USD 224,000-356,500 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

ElasticSearch @ 3 Go @ 3 Kafka @ 3 Prometheus @ 3 Python @ 3 Spark @ 3 Distributed Systems @ 3 Flink @ 3 Data Science @ 3 Communication @ 6 Mentoring @ 3 Thanos @ 3 API @ 3 OpenTelemetry @ 3

Details

At NVIDIA, the data science platform team builds and operates a high-scale observability foundation that carries metrics, logs, traces, profiles, and events used to understand and debug services. This Engineering Manager role stays close to technology: guiding architecture decisions, reviewing designs and code, and helping engineers solve distributed-systems challenges related to telemetry ingestion, storage, querying, and multi-region data flows.

Responsibilities

Lead a team of engineers who design and build core services, pipelines, and storage layers for NVIDIA’s global observability platform.
Create clear technical direction emphasizing simplicity, performance, and maintainability.
Define architecture for distributed ingestion services, time-series storage, log and trace pipelines, query paths, and multi-region data flows.
Partner with platform, infrastructure, and application teams to define data models, instrumentation patterns, APIs, and integration standards.
Strengthen engineering practices via better tooling, automated tests, schema management, API versioning, documentation, and safe rollout processes.
Help engineers solve distributed-systems issues including ingestion load, indexing pressure, compaction behavior, query fan-out, and replication patterns.
Drive predictable execution through clear priorities, collaborative planning, and alignment across teams.
Represent the observability platform across NVIDIA, gather feedback, and evolve the system to support future AI workloads.

Requirements

Bachelor’s or Master’s degree in Computer Science or a related technical field (or equivalent experience).
8+ years overall building distributed systems, with a focus on observability and monitoring systems, and 3+ years managing or leading engineers.
Experience with modern observability stacks such as Prometheus, Thanos, Mimir, Loki, OpenSearch, Jaeger, Tempo, or OpenTelemetry (or equivalent experience).
Strong foundations in distributed systems concepts including replication, sharding, durability, consensus, and performance tuning.
Hands-on experience designing or scaling ingestion pipelines, time-series engines, trace backends, or log indexing systems, especially in high-cardinality environments.
Ability to read and review Go or Python code and support engineers through technical decision-making.
Clear architectural thinking with a focus on stable APIs, predictable performance, and long-term evolution.
Experience mentoring engineers, improving technical judgment, and contributing to an inclusive engineering culture.
Strong communication skills and the ability to explain complex challenges with clarity.

Ways to stand out

Experience building or contributing to an observability or telemetry platform used at significant scale.
Contributions to open-source projects such as OpenTelemetry, Prometheus, Loki, Thanos, Tempo, Jaeger, ClickHouse, Mimir, or Elasticsearch.
Experience with high-throughput systems like Kafka, Flink, or Spark, or large-scale data collectors.
Deep knowledge of cardinality management, query performance, storage design, or retention optimization.
Experience designing multi-region architectures with emphasis on consistency, availability, and data locality.

Compensation and benefits

Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
Eligibility for equity and company benefits (link to NVIDIA benefits referenced in original posting).

About NVIDIA

NVIDIA leads developments in Artificial Intelligence, High-Performance Computing, and Visualization. The company values diversity and is an equal opportunity employer.

Application deadline

Applications for this job will be accepted at least until January 11, 2026.