Staff Software Engineer - Databases SRE

📍 Germany

EUR 109,700-131,700 per year

SENIOR

✅ Remote

Tech Stack
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

AWS @ 7 Azure @ 7 GCP @ 7 Go @ 4 Grafana @ 1 Helm @ 3 Java @ 4 Kubernetes @ 7 Linux @ 4 Networking @ 4 Python @ 4 SRE @ 7 Terraform @ 3

Details

Grafana Labs is looking for a Staff Software Engineer - SRE to help support Grafana Cloud’s highest value customers by increasing the reliability of their Cloud databases based on Mimir, Loki, Tempo, and Pyroscope.

This is a remote opportunity (open to candidates from the UK, Sweden, Spain, or Germany).

Responsibilities

Partner closely with product engineering squads (embedded model)
Own production reliability for high-SLA and complex customer environments
Design and implement automation to scale reliability practices
Ensure customers meet their SLO targets
Define and evolve per-tenant SLOs and reliability models
Proactively reduce SLO burn to prevent repeat incidents
Serve as a primary escalation point and on-call for relevant incidents
Lead customer-impacting incident response and post-incident reviews
Contribute to design docs and code reviews
Influence feature design to ensure production scalability and operability
Build automation to eliminate toil where needed
Improve alert quality and reduce noisy escalations

Requirements

8+ years engineering experience, 4+ in SRE/CRE/production engineering
Strong preference for candidates with formal customer reliability engineering experience
Strong Kubernetes experience in AWS, GCP, or Azure
Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
Experience operating multi-tenant systems in production
Strong experience designing and implementing SLOs
Experience with one or more programming languages (e.g., Go, Python, Java)
Experience with Linux operating systems internals and some knowledge of networking, cloud storage, and scaling
Excellent problem-solving and troubleshooting skills
Experience participating in blame-free Incident Response, following up on actions, and writing high-quality PIRs (Post Incident Reviews)
Ability to reason about performance, scaling, and failure modes
Comfortable working with autonomy and self-direction within an engineering team
Ability to partner deeply with product engineering teams
Intellectual curiosity, defaulting to transparency, high bias towards action, and kindness

Benefits

Equity and bonus (if applicable), plus other benefits listed here
100% remote, global culture
Open-source roots and transparent communication
In-person onboarding
Global annual leave policy of 30 days per annum, including 3 Grafana Shutdown Days reserved from annual leave