Used Tools & Technologies
Not specified
Required Skills & Competences ?
ElasticSearch @ 3 Go @ 3 Kafka @ 3 Prometheus @ 3 Python @ 3 Spark @ 3 Distributed Systems @ 3 Flink @ 3 Data Science @ 3 Communication @ 6 Mentoring @ 3 Thanos @ 3 API @ 3 OpenTelemetry @ 3Details
At NVIDIA, the data science platform team builds and operates a high-scale observability foundation that carries metrics, logs, traces, profiles, and events used to understand and debug services. This Engineering Manager role stays close to technology: guiding architecture decisions, reviewing designs and code, and helping engineers solve distributed-systems challenges related to telemetry ingestion, storage, querying, and multi-region data flows.
Responsibilities
- Lead a team of engineers who design and build core services, pipelines, and storage layers for NVIDIA’s global observability platform.
- Create clear technical direction emphasizing simplicity, performance, and maintainability.
- Define architecture for distributed ingestion services, time-series storage, log and trace pipelines, query paths, and multi-region data flows.
- Partner with platform, infrastructure, and application teams to define data models, instrumentation patterns, APIs, and integration standards.
- Strengthen engineering practices via better tooling, automated tests, schema management, API versioning, documentation, and safe rollout processes.
- Help engineers solve distributed-systems issues including ingestion load, indexing pressure, compaction behavior, query fan-out, and replication patterns.
- Drive predictable execution through clear priorities, collaborative planning, and alignment across teams.
- Represent the observability platform across NVIDIA, gather feedback, and evolve the system to support future AI workloads.
Requirements
- Bachelor’s or Master’s degree in Computer Science or a related technical field (or equivalent experience).
- 8+ years overall building distributed systems, with a focus on observability and monitoring systems, and 3+ years managing or leading engineers.
- Experience with modern observability stacks such as Prometheus, Thanos, Mimir, Loki, OpenSearch, Jaeger, Tempo, or OpenTelemetry (or equivalent experience).
- Strong foundations in distributed systems concepts including replication, sharding, durability, consensus, and performance tuning.
- Hands-on experience designing or scaling ingestion pipelines, time-series engines, trace backends, or log indexing systems, especially in high-cardinality environments.
- Ability to read and review Go or Python code and support engineers through technical decision-making.
- Clear architectural thinking with a focus on stable APIs, predictable performance, and long-term evolution.
- Experience mentoring engineers, improving technical judgment, and contributing to an inclusive engineering culture.
- Strong communication skills and the ability to explain complex challenges with clarity.
Ways to stand out
- Experience building or contributing to an observability or telemetry platform used at significant scale.
- Contributions to open-source projects such as OpenTelemetry, Prometheus, Loki, Thanos, Tempo, Jaeger, ClickHouse, Mimir, or Elasticsearch.
- Experience with high-throughput systems like Kafka, Flink, or Spark, or large-scale data collectors.
- Deep knowledge of cardinality management, query performance, storage design, or retention optimization.
- Experience designing multi-region architectures with emphasis on consistency, availability, and data locality.
Compensation and benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligibility for equity and company benefits (link to NVIDIA benefits referenced in original posting).
About NVIDIA
NVIDIA leads developments in Artificial Intelligence, High-Performance Computing, and Visualization. The company values diversity and is an equal opportunity employer.
Application deadline
Applications for this job will be accepted at least until January 11, 2026.