Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Docker @ 4
Go @ 7
Grafana @ 4
Kubernetes @ 4
DevOps @ 7
Python @ 7
Distributed Systems @ 4
Leadership @ 7
AWS @ 4
Communication @ 4
JavaScript @ 4
SRE @ 7
Prioritization @ 4
Reporting @ 4
Observability @ 4
AI @ 4
Change Management @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company building observability and performance testing products. This opportunity is with the Grafana Cloud k6 squad, which builds and operates Grafana Cloud k6 and Grafana Cloud Synthetics — a performance testing product used to run distributed tests from many regions and ingest large volumes of test data. This is a remote opportunity; applicants in Ireland time zones are preferred. The role focuses on establishing and scaling a cross-team culture of engineering excellence and strong DevOps/SRE practices, and is expected to grow into broader application and product development leadership over time.
Responsibilities
- Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
- Drive DevOps/SRE practices including incident response and post-incident reviews (PIRs), on-call readiness, runbooks, alerting, observability, and release/change management.
- Establish reliability frameworks such as SLIs, SLOs, and error budgets and use them to guide prioritization and engineering trade-offs.
- Provide visibility into system health through operational metrics and reliability reporting.
- Guide teams in design, development, evolution, and operation of large-scale distributed cloud systems.
- Influence product and system direction through design reviews, architectural discussions, and cross-team collaboration.
- Share knowledge via clear, high-quality documentation and technical communication, internally and when appropriate externally.
- As the reliability foundation matures, expand into broader application and product development leadership.
Requirements
- Strong experience with DevOps/SRE practices, including operating and evolving production systems at scale.
- Strong programming background in a modern language; Python and Go are primary (prior experience in those exact languages is not strictly required, but a strong programming background is).
- Experience designing, building, and operating large-scale distributed systems.
- Strong understanding of reliability engineering concepts (incident management, observability, failure modes).
- Experience with test automation, including performance and functional testing.
- Ability to influence engineering practices through clear technical communication, reviews, and collaboration.
- Strong interpersonal skills and ability to work effectively across teams.
- Familiarity with modern software engineering processes and delivery practices.
- Self-driven and comfortable operating with a high degree of autonomy and ambiguity.
Bonus Points
- Experience with containerized and cloud-native systems (Docker, Kubernetes, AWS).
- Familiarity with observability tooling and platforms (e.g., Grafana stack).
- Experience working with Python, Go, JavaScript and/or Jsonnet.
- Experience building or operating event-driven or asynchronous systems.
- Experience defining or applying SLIs/SLOs, error budgets, or other reliability metrics.
- Interest in or experience with building testing frameworks or developer tooling.
Compensation & Benefits
- In the Republic of Ireland, the base compensation range for this role is EUR 117,600 - EUR 141,120. Actual compensation may vary by level, experience, and skillset as assessed in the interview process.
- Benefits include equity, bonus (if applicable), other benefits referenced by Grafana Labs, 30 days annual leave per year (with 3 reserved for Grafana Shutdown Days), in-person onboarding, and additional company benefits.
Additional Details
- Remote-first company; 100% remote culture. This posting notes interest in applicants in Ireland time zones and lists Republic of Ireland (Remote) as the location for compensation reference.
- The team values open-source roots, transparency, autonomy, and collaboration. Grafana Labs also provides company-funded access to AI coding assistants and frontier models for developer productivity within security guidelines.