Staff Software Engineer - Databases SRE

at Grafana Labs

📍 Germany
📍 Spain
📍 United Kingdom
📍 Sweden

GBP 104,000-124,800 per year

SENIOR

✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 4 Grafana @ 4 Kubernetes @ 7 Linux @ 4 Terraform @ 3 Python @ 4 GCP @ 4 Java @ 4 Hiring @ 4 Leadership @ 7 AWS @ 4 Azure @ 4 Communication @ 7 Helm @ 3 Mentoring @ 7 Networking @ 4 SRE @ 4 Technical Leadership @ 7 Design Patterns @ 4 Observability @ 4

Details

Grafana Labs is hiring a Staff Software Engineer - SRE to support Grafana Cloud’s highest-value customers by increasing the reliability of cloud databases based on Mimir, Loki, Tempo, and Pyroscope. The role is remote and the team provides these databases as a SaaS product from AWS, GCP, and Azure across all regions. The SRE team is embedded within the Mimir, Loki, and Tempo squads and focuses on delivering exceptional reliability for high-SLA customers.

Responsibilities

Partner closely with product engineering squads using an embedded model.
Own production reliability for high-SLA and complex customer environments.
Design and implement automation to scale reliability practices and eliminate toil.
Ensure customers meet SLO targets, define and evolve per-tenant SLOs and reliability models.
Proactively reduce SLO burn to prevent repeat incidents.
Serve as a primary escalation point and participate on-call for relevant incidents.
Lead customer-impacting incident response and post-incident reviews (PIRs/post-mortems).
Contribute to design documents and code reviews, influence feature design for scalability and operability.
Improve alert quality and reduce noisy escalations.
Teach and communicate SRE best practices to engineering teams.

Requirements

8+ years engineering experience, with 4+ years in SRE/CRE/production engineering (strong preference for formal customer reliability engineering experience).
Strong Kubernetes experience in at least one cloud provider (AWS, GCP, or Azure).
Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
Strong experience with technical leadership: leading projects, mentoring engineers, and acting as a force-multiplier.
Experience operating multi-tenant systems in production and designing/implementing SLOs.
Experience with one or more programming languages (examples listed: Go, Python, Java).
Knowledge of Linux operating system internals, networking, cloud storage, performance, scaling, and failure modes.
Excellent problem-solving and troubleshooting skills and calm, active participation in blame-free incident response.
Ability to partner deeply with product engineering teams and work with autonomy.
Strong communication skills, intellectual curiosity, transparency, bias for action, and a collaborative mindset.

Day-to-day

Regular 1:1s with manager and colleagues.
Review and create SLOs; investigate and reduce SLO budget burn through monitoring, automation, self-healing, auto-scaling, etc.
Improve observability of customers within their environments.
Design and implement solutions for reliability and scalability to meet growing demand.
Develop fault-tolerant design patterns and participate in PR reviews and design docs.
Participate in incident response including investigation, resolution, PIR, and customer communication via Bridge calls when needed.

Compensation & Benefits

UK base compensation range: £103,958 - £124,750 (actual compensation may vary by level, experience, and skillset). Benefits include equity, possible bonus, and other benefits listed on the company careers page.
100% remote company with in-person onboarding; global annual leave policy of 30 days per annum (subject to local legislation).