Engineering Manager, Observability Platform

at Nvidia
USD 224,000-356,500 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

ElasticSearch @ 3 Go @ 3 Kafka @ 3 Prometheus @ 3 Python @ 3 Spark @ 3 Distributed Systems @ 3 Flink @ 3 Data Science @ 3 Communication @ 6 Mentoring @ 3 Thanos @ 3 API @ 3 OpenTelemetry @ 3

Details

At NVIDIA, the data science platform team builds and operates a high-scale observability foundation that carries metrics, logs, traces, profiles, and events used to understand and debug services. This Engineering Manager role stays close to technology: guiding architecture decisions, reviewing designs and code, and helping engineers solve distributed-systems challenges related to telemetry ingestion, storage, querying, and multi-region data flows.

Responsibilities

  • Lead a team of engineers who design and build core services, pipelines, and storage layers for NVIDIA’s global observability platform.
  • Create clear technical direction emphasizing simplicity, performance, and maintainability.
  • Define architecture for distributed ingestion services, time-series storage, log and trace pipelines, query paths, and multi-region data flows.
  • Partner with platform, infrastructure, and application teams to define data models, instrumentation patterns, APIs, and integration standards.
  • Strengthen engineering practices via better tooling, automated tests, schema management, API versioning, documentation, and safe rollout processes.
  • Help engineers solve distributed-systems issues including ingestion load, indexing pressure, compaction behavior, query fan-out, and replication patterns.
  • Drive predictable execution through clear priorities, collaborative planning, and alignment across teams.
  • Represent the observability platform across NVIDIA, gather feedback, and evolve the system to support future AI workloads.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or a related technical field (or equivalent experience).
  • 8+ years overall building distributed systems, with a focus on observability and monitoring systems, and 3+ years managing or leading engineers.
  • Experience with modern observability stacks such as Prometheus, Thanos, Mimir, Loki, OpenSearch, Jaeger, Tempo, or OpenTelemetry (or equivalent experience).
  • Strong foundations in distributed systems concepts including replication, sharding, durability, consensus, and performance tuning.
  • Hands-on experience designing or scaling ingestion pipelines, time-series engines, trace backends, or log indexing systems, especially in high-cardinality environments.
  • Ability to read and review Go or Python code and support engineers through technical decision-making.
  • Clear architectural thinking with a focus on stable APIs, predictable performance, and long-term evolution.
  • Experience mentoring engineers, improving technical judgment, and contributing to an inclusive engineering culture.
  • Strong communication skills and the ability to explain complex challenges with clarity.

Ways to stand out

  • Experience building or contributing to an observability or telemetry platform used at significant scale.
  • Contributions to open-source projects such as OpenTelemetry, Prometheus, Loki, Thanos, Tempo, Jaeger, ClickHouse, Mimir, or Elasticsearch.
  • Experience with high-throughput systems like Kafka, Flink, or Spark, or large-scale data collectors.
  • Deep knowledge of cardinality management, query performance, storage design, or retention optimization.
  • Experience designing multi-region architectures with emphasis on consistency, availability, and data locality.

Compensation and benefits

  • Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
  • Eligibility for equity and company benefits (link to NVIDIA benefits referenced in original posting).

About NVIDIA

NVIDIA leads developments in Artificial Intelligence, High-Performance Computing, and Visualization. The company values diversity and is an equal opportunity employer.

Application deadline

Applications for this job will be accepted at least until January 11, 2026.