Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 7
Grafana @ 4
Kubernetes @ 4
Prometheus @ 4
SQL @ 4
Leadership @ 4
Communication @ 4
Parquet @ 4
Rust @ 7
API @ 4
Technical Leadership @ 4
LLM @ 4
OpenTelemetry @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company building observability tooling used globally. Tempo is the open-source distributed tracing backend behind Grafana Cloud Traces and Grafana Enterprise Traces (GET). This is a remote position; Grafana Labs is seeking candidates in the United States and Canada.
Responsibilities
- Lead multi-quarter technical initiatives from problem framing through rollout (e.g., trace aggregation APIs, Limitless Tempo, autoscaling cells, query engine improvements).
- Own the architecture of core Tempo components: ingestion, storage, query, and metrics generation. Drive design reviews and document design decisions.
- Design APIs for humans and agents; shape next-generation Tempo interfaces (structured, deterministic, discoverable) for downstream products and LLM-driven assistants.
- Drive operational excellence: own outcomes against concrete SLOs (P99 write latency, incident recurrence, TCO per ingested GB), reduce toil, and push toward Zero Ops through automation and parameterized rollouts.
- Partner with Product and sibling teams (App Observability, Asserts, Drilldown, Grafana Assistant) to understand consumption patterns and unblock product teams.
- Mentor engineers through code review, design feedback, pairing, and clear technical writing.
- Participate in on-call for services you help build; be a force multiplier in incident response and post-incident learning.
- Contribute to the Tempo open-source project: engage the community, review external contributions, and help steer the project in the open.
Requirements
- Technical leadership with a track record of leading complex, multi-quarter initiatives spanning design, delivery, and operations.
- Deep, hands-on experience building and operating distributed data systems in production (ingestion pipelines, storage engines, query execution, or similar).
- Strong software craftsmanship; ability to write clean, robust, performant, and maintainable code.
- Strong Go experience, or the ability and path to become proficient in Go (Tempo is written in Go). Deep experience in other systems languages (Rust, C, C++) translates well.
- Operational mindset: experience owning production services, carrying a pager, reducing toil, and treating SLOs as a product feature.
- Customer focus and pragmatism: ability to break problems into short feedback loops (MVP -> learn -> iterate).
- Leadership through writing and collaboration: lead via design docs, reviews, and shipped code in a fully remote, asynchronous environment.
Bonus (nice-to-have)
- Experience with tracing, OpenTelemetry, or large-scale observability systems.
- Experience designing query languages, SQL/TraceQL-like engines, or programmatic APIs consumed by services/agents.
- Experience with columnar storage formats (e.g., Parquet) or purpose-built on-disk formats for analytical workloads.
- Experience operating multi-tenant, multi-cell SaaS infrastructure at scale on Kubernetes.
- Experience building for AI/LLM consumers: structured APIs, metadata/discovery endpoints, deterministic outputs, evaluation harnesses.
- Open-source contribution or maintainership and comfort engaging a community in the open.
- Experience as an on-call user of Grafana, Prometheus, Loki, or Tempo.
- Experience in fully remote, globally distributed teams.
How we work
We are a remote-first team that works asynchronously and values clear written communication. The team invests heavily in developer productivity and allows the use of modern AI coding assistants within security guidelines. Tempo is relied upon by global organizations to monitor critical applications and infrastructure; staff engineers are expected to contribute to reliability, usability, and product direction.
Compensation & Benefits
- In the United States, the compensation range for this role is $174,986 - $209,983 USD (actual compensation may vary by level, experience, and skillset).
- All roles include Restricted Stock Units (RSUs).
- Remote-first company culture; in-person onboarding is provided.
- Global annual leave policy: 30 days per annum (3 days reserved for Grafana Shutdown Days; local legislation will be complied with where applicable).
Equal Opportunity & Privacy
Grafana Labs is an equal opportunity employer and will recruit, train, compensate, and promote regardless of protected characteristics. The company may utilize AI tools in recruitment to assist matching CVs to postings; inbound CVs are manually reviewed. For details about personal data use, candidates are referred to Grafana's applicant privacy policy.