Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Docker @ 4
Go @ 7
Grafana @ 4
Kubernetes @ 4
DevOps @ 7
Python @ 7
Distributed Systems @ 4
Leadership @ 4
AWS @ 4
Communication @ 4
JavaScript @ 4
SRE @ 7
Prioritization @ 4
Reporting @ 4
Observability @ 4
AI @ 4
Change Management @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company that builds observability tooling (Grafana, Mimir, Loki, Tempo) and managed offerings via Grafana Cloud. This role is part of the Grafana Cloud k6 squad, which builds and operates the performance testing product (Grafana k6, Grafana Cloud k6, Grafana Cloud Synthetics). The team runs distributed load tests from many regions, ingesting large volumes of test data used to view, correlate, and analyze metrics.
This opportunity focuses on establishing and scaling a cross-team culture of engineering excellence by setting standards and guiding adoption of strong DevOps/SRE practices to improve reliability, availability, and operational ownership. As the reliability foundation matures, the role expands into broader application and product development leadership.
Responsibilities
- Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
- Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
- Establish reliability frameworks such as SLIs/SLOs and error budgets, and use them to guide prioritization and engineering trade-offs.
- Provide visibility into system health through clear operational metrics and reliability reporting.
- Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
- Influence product and system direction through design reviews, architectural discussions, and cross-team collaboration.
- Share knowledge through clear, high-quality documentation and technical communication—internally and, where appropriate, externally.
- Grow into broader application and product development leadership, contributing architectural and technical depth beyond operations.
- Use modern AI coding assistants as part of the daily workflow where appropriate (company provides usage budget and access to frontier models).
Requirements
- Strong experience with DevOps/SRE practices, including operating and evolving production systems at scale.
- Strong programming background in a modern language (Python and Go are primary, though prior experience is not strictly required).
- Experience designing, building, and operating large-scale distributed systems.
- Strong understanding of reliability engineering concepts (incident management, observability, failure modes).
- Experience with test automation, including performance and functional testing.
- Ability to influence engineering practices through clear technical communication, reviews, and collaboration.
- Strong interpersonal skills and ability to work effectively across teams.
- Familiarity with modern software engineering processes and delivery practices.
- Self-driven and comfortable operating with a high degree of autonomy and ambiguity.
Preferred / Bonus Qualifications
- Experience with containerized and cloud-native systems (Docker, Kubernetes, AWS).
- Familiarity with observability tooling and platforms (for example, the Grafana stack).
- Experience working with Python, Go, JavaScript and/or Jsonnet.
- Experience building or operating event-driven or asynchronous systems.
- Experience defining or applying SLIs/SLOs, error budgets, or reliability metrics.
- Interest in, or experience with, building testing frameworks or developer tooling.
Compensation & Benefits
- In Canada, the base compensation range for this role is CAD 186,368 - CAD 223,642. Actual compensation may vary based on level, experience, and skillset as assessed in the interview process. Benefits include equity, bonus (if applicable) and other benefits listed on Grafana's careers page.
- 100% remote company with in-person onboarding and a global annual leave policy (30 days per annum, with 3 days reserved for Grafana Shutdown Days; local legislation applied where applicable).
Location & Working Arrangements
- This is a remote opportunity; applicants in Canadian time zones are especially encouraged. The posting is for Canada (Remote).
Equal Opportunity & Recruitment Notes
- Grafana Labs is an equal opportunity employer.
- The company may utilize AI tools in its recruitment process to assist in matching information provided in CVs to job postings; inbound CVs are also reviewed manually.