Staff Backend Engineer - Databases Tempo

📍 Canada
📍 United States
CAD 186,400-223,600 per year
SENIOR
✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Security @ 4 Go @ 7 Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 SQL @ 4 Leadership @ 4 Communication @ 4 Parquet @ 4 Rust @ 7 API @ 4 Technical Leadership @ 4 LLM @ 4 OpenTelemetry @ 4 Observability @ 4 AI @ 4

Details

Grafana Labs is a remote-first, open-source company building observability tools used by millions of users and thousands of companies. Tempo is Grafana's open-source distributed tracing backend powering Grafana Cloud Traces and Grafana Enterprise Traces.

This role is a remote position (seeking candidates in the US and Canada) on the Tempo team, focusing on evolving Tempo from a SaaS database into a platform that powers Grafana’s next generation of observability products. The team is moving from foundational work to product and operational excellence and needs a Staff Engineer to set technical direction and deliver high-impact initiatives.

Responsibilities

  • Lead multi-quarter technical initiatives from problem framing through rollout (examples: trace aggregation APIs, Limitless Tempo, autoscaling cells, query engine improvements).
  • Own the architecture of core Tempo components: ingestion, storage, query, and metrics generation; drive design reviews and document design decisions.
  • Design APIs for human and agent consumption, shaping structured, deterministic, discoverable interfaces for downstream products and LLM-driven assistants.
  • Drive operational excellence against concrete SLOs (P99 write latency, incident recurrence, TCO per ingested GB) and reduce toil via automation, parameterized rollouts, and actionable alerts.
  • Partner with Product and sibling teams (App Observability, Asserts, Drilldown, Grafana Assistant) to understand consumption patterns and unblock downstream teams.
  • Mentor engineers through code review, design feedback, pairing, and clear technical writing.
  • Participate in on-call for services you help build and lead incident response and post-incident learning.
  • Contribute to the Tempo open-source project: engage the community, review external contributions, and help steer the project publicly.

Example problems you could work on

  • Trace aggregation and higher-density APIs: extend TraceQL metrics and design LLM-friendly response types.
  • Autoscaling end-to-end: customer limits, Tempo cells with hysteresis, predictive scaling, and safe scale-down.
  • Agent-scale ingestion and query: guardrails for bursty, high-cardinality, agent-generated workloads.
  • Query performance: new data formats, smarter query pipelines, and targeted optimizations for large query ranges.
  • Rollouts and multi-cell operations: parameterized rollouts, push-button deploys, and tooling to operate hundreds of cells safely.
  • Limits and self-service: customer-facing configuration and observability to minimize escalations.

Requirements

  • Technical leadership with a track record of leading complex, multi-quarter initiatives spanning design, delivery, and operations.
  • Substantial hands-on experience building and operating distributed data systems in production (ingestion pipelines, storage engines, query execution, or similar).
  • Strong software craftsmanship: writing clean, robust, maintainable, performant code and making pragmatic trade-offs.
  • Strong Go experience, or the ability/path to become strong in Go (Tempo is written in Go). Deep experience in other systems languages (Rust, C, C++) translates well.
  • Operational mindset: ownership of production services, on-call experience, toil reduction, and treating SLOs as a product feature.
  • Customer focus and pragmatism: break complex problems into short feedback loops and iterate using MVPs.
  • Leadership through writing and collaboration in a fully remote, asynchronous environment.

Bonus Points

  • Experience with tracing, OpenTelemetry, or large-scale observability systems.
  • Experience designing query languages, SQL/TraceQL-like engines, or programmatically consumed APIs.
  • Experience with columnar storage formats (e.g., Parquet) or purpose-built on-disk formats for analytical workloads.
  • Experience operating multi-tenant, multi-cell SaaS infrastructure at scale on Kubernetes.
  • Experience building APIs for AI/LLM consumers: structured APIs, metadata/discovery endpoints, deterministic outputs, evaluation harnesses.
  • Open-source contribution or maintainership and comfort engaging a public community.
  • Experience as an on-call user of Grafana, Prometheus, Loki, or Tempo.
  • Experience in a fully remote, globally distributed team.

How We Work

  • Remote-first, asynchronous collaboration with regular video meetings. Emphasis on clear communication, design docs, and written decisions.
  • Use of modern AI coding assistants is supported (company-funded usage budget) subject to security guidelines; access to frontier models is available for prototyping and productivity.

Compensation & Benefits

  • Canada compensation range: $186,368 - $223,642 CAD per year. Actual compensation may vary based on level, experience, and skillset.
  • All roles include Restricted Stock Units (RSUs).
  • In-person onboarding and a global annual leave policy of 30 days per annum (3 days reserved for Grafana Shutdown Days). Company will comply with local legislation where applicable.

Equal Opportunity

Grafana Labs is an equal opportunity employer and recruits, trains, compensates, and promotes regardless of protected characteristics. The company may use AI tools to assist in recruitment processing while continuing manual review of CVs.