Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 4
Grafana @ 4
Kubernetes @ 7
Linux @ 4
Terraform @ 3
Python @ 4
GCP @ 4
Java @ 4
Hiring @ 4
Leadership @ 7
AWS @ 4
Azure @ 4
Helm @ 3
Mentoring @ 7
Networking @ 4
SRE @ 4
Technical Leadership @ 7
Design Patterns @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is hiring a Staff Software Engineer (SRE) to support Grafana Cloud’s database products (Mimir, Loki, Tempo, Pyroscope) delivered as SaaS from AWS, GCP, and Azure across regions. This role is embedded within the Mimir, Loki, and Tempo squads and focuses on increasing reliability for high-SLA customers. This is a remote opportunity targeting candidates located in the United Kingdom, Sweden, Spain, or Germany.
Responsibilities
- Partner closely with product engineering squads (embedded model)
- Own production reliability for high-SLA and complex customer environments
- Design and implement automation to scale reliability practices
- Ensure customers meet SLO targets and define/evolve per-tenant SLOs and reliability models
- Proactively reduce SLO burn to prevent repeat incidents
- Serve as a primary escalation point and participate on-call for relevant incidents
- Lead customer-impacting incident response and post-incident reviews (PIRs)
- Contribute to design docs and code reviews
- Influence feature design to ensure production scalability and operability
- Build automation to eliminate toil and improve alert quality, reducing noisy escalations
Requirements
- 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering (strong preference for formal customer reliability engineering experience)
- Strong Kubernetes experience in AWS, GCP, or Azure
- Familiarity with infrastructure-as-code tooling such as Helm, Terraform, Jsonnet
- Experience operating multi-tenant systems in production
- Strong experience designing and implementing SLOs
- Experience with one or more programming languages (examples given: Go, Python, Java)
- Knowledge of Linux operating system internals, networking, cloud storage, and scaling
- Excellent problem-solving and troubleshooting skills
- Experience in calm, blame-free Incident Response, follow-up actions, and writing high-quality Post Incident Reviews (PIRs)
- Ability to reason about performance, scaling, and failure modes
- Strong technical leadership: leading projects, mentoring engineers, and serving as a force-multiplier
- Ability to partner deeply with product engineering teams and work autonomously
Your day-to-day
- Regular 1:1s with your manager and colleagues
- Review and create SLOs, investigate and reduce budget burn via monitoring, automation, self-healing, auto-scaling, etc.
- Improve observability of customers within their environments
- Design and implement solutions to ensure reliability and scalability
- Develop fault-tolerant design patterns and consider reliability during the service lifecycle
- Collaborate with Engineering Leaders on product strategy, roadmaps, and technical designs
- Participate in PR review and design doc collaboration
- Teach SRE best practices and participate in incident response including Bridge calls when necessary
Compensation & Benefits
- Germany base compensation range: EUR 109,709 - EUR 131,651 (actual compensation may vary by level, experience, and assessed skillset)
- Benefits include equity, bonus (if applicable), and other benefits detailed on Grafana Labs careers pages
- 100% remote company with in-person onboarding
- Global annual leave policy of 30 days per annum (3 days reserved for Grafana Shutdown Days). Local legislation will be complied with where applicable.
Why You’ll Thrive at Grafana Labs
- 100% remote, global culture
- Scaling organization with meaningful work and transparency
- Open-source roots and empowered teams
- Career growth pathways and approachable leadership
- Culture valuing curiosity, transparency, bias toward action, and kindness
Other notes
- This role is open to candidates located in the United Kingdom, Sweden, Spain, or Germany.
- Compensation ranges are country-specific; other country applicants will receive market-specific pay range information during the hiring process.
- Grafana Labs may utilize AI tools in recruitment; manual review of CVs is still performed.