Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 4
Grafana @ 4
Kubernetes @ 3
Linux @ 4
Terraform @ 3
Python @ 4
GCP @ 4
Java @ 4
Hiring @ 4
Leadership @ 7
AWS @ 4
Azure @ 4
Helm @ 3
Mentoring @ 7
Networking @ 4
SRE @ 4
Technical Leadership @ 7
Design Patterns @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is the company behind the open observability cloud and Grafana Cloud, a fully managed observability platform provided from AWS, GCP, and Azure. We are a 100% remote company with a global team and an open-source culture. This role is a remote opportunity and we are looking for candidates from the UK, Sweden, Spain or Germany.
Role summary
We are hiring a Staff Software Engineer - SRE to support high-value Grafana Cloud customers by increasing the reliability of our Cloud databases based on Mimir, Loki, Tempo, and Pyroscope. The SRE team is embedded within the Mimir, Loki, and Tempo squads and focuses on ensuring our database products deliver exceptional reliability for high-SLA customers across AWS, GCP, and Azure.
Responsibilities
- Partner closely with product engineering squads (embedded model).
- Own production reliability for high-SLA and complex customer environments.
- Design and implement automation to scale reliability practices and eliminate toil.
- Ensure customers meet SLO targets and define/evolve per-tenant SLOs and reliability models.
- Proactively reduce SLO burn to prevent repeat incidents.
- Serve as a primary escalation point and be on-call for relevant incidents.
- Lead customer-impacting incident response and post-incident reviews (PIRs/post-mortems).
- Contribute to design docs and code reviews, and influence feature design for production scalability and operability.
- Improve alert quality and reduce noisy escalations.
Requirements
- 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering; strong preference for formal customer reliability engineering experience.
- Strong Kubernetes experience in AWS, GCP, or Azure and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Strong technical leadership experience: leading projects, mentoring engineers, and acting as a force-multiplier.
- Experience operating multi-tenant systems in production.
- Strong experience designing and implementing SLOs.
- Experience with one or more programming languages (examples: Go, Python, Java).
- Experience with Linux internals; knowledge of networking, cloud storage, and scaling.
- Excellent problem-solving and troubleshooting skills.
- Experience participating in blame-free incident response, following up on actions, and writing high-quality PIRs/post-incident reviews.
- Ability to reason about performance, scaling, and failure modes.
- Comfortable working with autonomy and partnering deeply with product engineering teams.
- Intellectual curiosity, transparency, bias toward action, and interpersonal kindness are highly valued.
Day-to-day
- Regular 1:1s with manager and colleagues.
- Review and create SLOs, investigate ways to reduce SLO budget burn (monitoring, automation, self-healing, auto-scaling).
- Improve observability of customer environments.
- Design and implement solutions for reliability and scalability to meet growing demand.
- Develop fault-tolerant design patterns and consider reliability throughout the service lifecycle.
- Collaborate with engineering leaders on product strategy, roadmaps, and technical designs.
- Participate in PR review and design doc collaboration.
- Teach SRE best practices and participate in incident response including Bridge calls with customers when necessary.
Technologies & systems mentioned
- Databases / observability components: Mimir, Loki, Tempo, Pyroscope
- Cloud platforms: AWS, GCP, Azure
- Kubernetes
- Infrastructure-as-code: Helm, Terraform, Jsonnet
- Programming languages: Go, Python, Java (examples)
- Linux internals, networking, cloud storage, scaling, monitoring/observability, SLOs, automation
Compensation & benefits
- Sweden base compensation range: SEK 878,578 - SEK 1,054,294 (actual compensation varies by level, experience, and skillset).
- Benefits include equity, bonus (if applicable), primary benefits listed on the company careers page, 30 days annual leave (global policy), in-person onboarding, and other benefits as described by Grafana Labs.
Other notes
- Grafana Labs is an equal opportunity employer and may utilize AI tools in its recruitment process. Compensation ranges are country specific; candidates applying from other locations will discuss their market's defined pay range with a recruiter.