Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Grafana @ 4
Debugging @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company that builds Grafana and the Grafana Cloud observability platform. The Databases department owns and operates telemetry databases including Mimir (metrics), Loki (logs), Tempo (traces) and Pyroscope (profiles). The Adaptive Telemetry group builds Adaptive Metrics, Adaptive Logs, Adaptive Traces and Adaptive Profiles to help customers control and optimize telemetry data retention and cost.
This is a remote position targeting candidates in United States time zones.
Responsibilities
- Drive technical strategy and roadmap: define architectural vision, prioritize work that unlocks major product or platform improvements, and influence product and engineering decisions.
- Lead end-to-end delivery of large, cross-functional projects: own planning, design, execution, rollout and long-term operation of large initiatives.
- Own architecture, reliability, performance and cost for critical systems: make pragmatic architecture choices balancing scalability, availability, latency and cost while keeping systems maintainable and evolvable.
- Define SLOs/SLIs and lead incident response: establish measurable reliability targets, run high-severity incident response, lead blameless post-mortems, and drive systemic fixes and automation to prevent recurrence.
- Improve observability, automation and operational readiness: champion telemetry, alerting, runbooks, capacity planning and automation efforts that reduce toil, speed debugging and lower MTTR.
- Align stakeholders and remove blockers: coordinate across Product, Design and other teams to align priorities and unblock delivery for large initiatives.
- Mentor and grow engineering talent: coach senior and mid-level engineers, lead design reviews, raise engineering standards, and help teammates make sound technical tradeoffs.
- Represent engineering internally and externally: communicate technical strategy clearly to non-engineering stakeholders and represent the team in cross-team planning.
- Use modern AI coding assistants as part of the workflow where appropriate (company provides usage budget and access to frontier models).