Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 4
Grafana @ 4
Kafka @ 3
Kubernetes @ 4
Prometheus @ 3
IaC @ 4
Python @ 4
Distributed Systems @ 4
Leadership @ 4
Communication @ 7
Rust @ 4
Microservices @ 4
Debugging @ 4
Technical Leadership @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana and customers across industries. The Databases department owns and operates the telemetry databases (Mimir for metrics, Loki for logs, Tempo for traces, and Pyroscope for profiles) and offers these as Cloud services supporting Grafana Cloud.
The Adaptive Telemetry group develops solutions (Adaptive Metrics, Adaptive Logs, Adaptive Traces, Adaptive Profiles) that help customers control and optimize telemetry data so only the most valuable data is retained. This is a remote position targeted at candidates in Canadian time zones.
Responsibilities
- Drive technical strategy and roadmap: define architectural vision, prioritize work that unlocks major product or platform improvements, and influence product and engineering decisions.
- Lead end-to-end delivery of large, cross-functional projects: own planning, design, execution, rollout and long-term operation of large initiatives.
- Own architecture, reliability, performance and cost for critical systems: make pragmatic architecture choices balancing scalability, availability, latency and cost while keeping systems maintainable and evolvable.
- Define SLOs/SLIs and lead incident response: establish measurable reliability targets, run high-severity incident response, lead blameless post-mortems, and drive systemic fixes and automation to prevent recurrence.
- Improve observability, automation and operational readiness: champion telemetry, alerting, runbooks, capacity planning and automation efforts that reduce toil, speed debugging and lower MTTR.
- Align stakeholders and remove blockers: coordinate across Product, Design and other teams to align priorities, negotiate tradeoffs, and unblock delivery for large initiatives.
- Mentor and grow engineering talent: coach senior and mid-level engineers, lead design reviews, raise engineering standards, and help teammates make sound technical tradeoffs.
- Represent engineering internally and externally: communicate technical strategy clearly to non-engineering stakeholders and represent the team in cross-team planning.
- Use modern AI coding assistants as part of daily workflow (optional within security guidelines); company provides a funded usage budget and access to frontier models.
Requirements
- Proven delivery of large distributed systems with evidence of technical leadership and impact.
- Strong systems-design instincts and deep understanding of tradeoffs around latency, consistency, availability, scaling and cost.
- Hands-on cloud and platform experience with cloud-native architectures (microservices, containers/Kubernetes, IaC) and operational practices to keep them healthy.
- Reliability and performance ownership: comfortable defining SLOs/SLIs, capacity planning, tuning performance, and driving reliability work end-to-end.
- Excellent coding and design skills; Grafana uses Go (Python/C/C++/Rust or similar translate well).
- Comfort with AI-assisted development and ideally practical experience integrating AI-powered developer tools into a team workflow.
- Experience with messaging and telemetry: familiarity with streaming/messaging systems (e.g., Kafka) and observability tooling (Prometheus/Grafana or equivalents).
- Strong communication skills and ability to influence without authority in a remote-first environment.
Compensation & Rewards / Benefits
- Base compensation range in Canada: CAD 186,368 - CAD 223,642 (actual compensation may vary based on level, experience, and skillset).
- Benefits include equity, bonus (if applicable) and other benefits listed by the company.
- 100% remote company with global culture; in-person onboarding is provided.
- Global annual leave policy of 30 days per annum (subject to local legislation), including 3 Grafana Shutdown Days.
Additional Information
- The role is remote-first and specifically seeking candidates in Canadian time zones.
- The company emphasizes open-source values, transparency, autonomy, and a collaborative culture.
- Grafana Labs is an Equal Opportunity Employer and may utilize AI tools in its recruitment process.
- Legal and privacy notes regarding applicant data are provided by the company.