Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 4
Grafana @ 4
Kubernetes @ 7
Linux @ 4
Terraform @ 3
Python @ 4
GCP @ 4
Java @ 4
Leadership @ 4
AWS @ 4
Azure @ 4
Communication @ 7
Helm @ 3
Mentoring @ 4
Networking @ 4
SRE @ 4
Technical Leadership @ 4
Design Patterns @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is the company behind the open observability cloud and Grafana Cloud, a fully managed observability platform. We are a 100% remote company with a global team and an open-source culture. This role is a remote opportunity; we are looking for candidates from the United Kingdom, Sweden, Spain, Germany, or Ireland.
About the role
You will join the SRE team embedded within the Mimir, Loki, and Tempo squads to increase the reliability of Grafana Cloud database products (Mimir, Loki, Tempo, Pyroscope) delivered as SaaS across AWS, GCP, and Azure. The role focuses on production reliability for high-SLA, multi-tenant, and complex customer environments.
Responsibilities
- Partner closely with product engineering squads (embedded model).
- Own production reliability for high-SLA and complex customer environments.
- Design and implement automation to scale reliability practices and eliminate toil.
- Ensure customers meet SLO targets; define and evolve per-tenant SLOs and reliability models.
- Proactively reduce SLO burn to prevent repeat incidents.
- Serve as a primary escalation point and participate on-call for relevant incidents.
- Lead customer-impacting incident response and post-incident reviews (PIRs).
- Contribute to design docs and code reviews; influence feature design for scalability and operability.
- Improve alert quality and reduce noisy escalations.
- Improve observability of customers and develop fault-tolerant design patterns.
Requirements
- 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering; preference for formal customer reliability engineering experience.
- Strong Kubernetes experience running in AWS, GCP, or Azure.
- Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Strong experience operating multi-tenant systems in production and designing/implementing SLOs.
- Experience with one or more programming languages (examples given: Go, Python, Java).
- Experience with Linux operating system internals; some knowledge of networking, cloud storage, and scaling.
- Excellent problem-solving and troubleshooting skills; experience in blame-free incident response and writing high-quality PIRs.
- Technical leadership experience: leading projects, mentoring engineers, and serving as a force-multiplier.
- Comfortable partnering deeply with product engineering teams; strong written and verbal communication for design docs, code review, and incident communication.
Day-to-day
- Regular 1:1s with manager and colleagues.
- Review and create SLOs; investigate and reduce SLO budget burn via monitoring, automation, self-healing, and autoscaling.
- Participate in PR reviews, design docs, and collaborate across teams.
- Participate in incident response from investigation through resolution and customer communication when needed.
Compensation & Benefits
- Ireland base compensation range: €117,600 - €141,120 (actual compensation may vary by level, experience, and skillset).
- Benefits include equity, bonus (if applicable), global annual leave policy (30 days per annum; 3 days reserved for Grafana Shutdown Days), in-person onboarding, and other benefits listed by Grafana Labs.
- Company-funded AI tooling allowance and encouragement of pragmatic AI-assisted development within security guidelines.
Equal Opportunity
Grafana Labs is an equal opportunity employer and welcomes applications from diverse backgrounds. For details on how personal data is used in the application process, see Grafana Labs' privacy policy.