Staff Software Engineer - Databases SRE

📍 United Kingdom

GBP 104,000-124,800 per year

SENIOR

✅ Remote

Tech Stack
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

AI AWS @ 3 Azure @ 3 Codex Design Patterns GCP @ 3 Go @ 4 Grafana @ 9 Helm @ 3 Java @ 4 Kubernetes @ 3 Leadership @ 7 Linux @ 4 Mentoring @ 7 Networking @ 4 Observability Python @ 4 SRE @ 9 Security Technical Leadership @ 7 Terraform @ 3

Details

We are looking for a Staff Software Engineer - SRE to help us support our highest value Grafana Cloud customers by increasing the reliability of our Cloud databases that are based on Mimir, Loki, Tempo, and Pyroscope.

Grafana Cloud provides these databases as a SaaS product from AWS, GCP, and Azure across all regions.

The SRE team is embedded within the Mimir, Loki, and Tempo squads and focuses on ensuring that Grafana Cloud’s database products deliver exceptional reliability for our highest-SLA customers.

In this role, you will:

Responsibilities

Partner closely with product engineering squads (embedded model)
Own production reliability for high-SLA and complex customer environments
Design and implement automation to scale our reliability practices
Ensuring our customers meet our SLO targets
Define and evolve per-tenant SLOs and reliability models
Proactively reduce SLO burn to prevent repeat incidents
Serving as a primary escalation point and on-call for relevant incidents
Lead customer-impacting incident response and post-incident reviews
Contribute to design docs and code reviews
Influence feature design to ensure production scalability and operability
Build automation to eliminate toil where needed
Improve alert quality and reduce noisy escalations

Requirements

8+ years engineering experience, 4+ in SRE/CRE/production engineering; strong preference for formal customer reliability engineering experience
Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
Strong experience with technical leadership, leading a team through projects, mentoring other engineers on the team and serving as a force-multiplier
Experience operating multi-tenant systems in production
Strong experience designing and implementing SLOs
Experience with one or more programming languages (e.g. Go, Python, Java, etc)
Experience with Linux operating systems internals, and some knowledge of networking, cloud storage, and scaling
Excellent problem-solving and troubleshooting skills
Experience with calmly and actively participating in blame-free Incident Response, following up on actions, and writing high quality PIRs (Post Incident Reviews, a.k.a. post-mortem documents)
Ability to reason about performance, scaling, and failure modes
Comfortable working within an engineering team where individuals are encouraged to have a strong sense of autonomy and self-direction
Ability to partner deeply with product engineering teams
Intellectually curious; defaults to transparency; high bias towards action; kind

Your day-to-day will include

Regular 1:1s to with your manager and colleagues
Reviewing and creating SLOs, proactively investigating ways in which we can further reduce budget burn for those SLOs (including improvements to monitoring, automation, increasing self-healing, auto-scaling, etc.)
Improve observability of customers within their environments
Designing and implementing solutions to ensure reliability and scalability of our environments can meet rapidly increasing demands
Develop fault-tolerant design patterns ensuring reliability at all stages of the service lifecycle
Collaborating with Engineering Leaders to help define and influence product strategy, roadmaps and technical designs
Participate in PR review and collaborate on Design Docs
Teach others about Site Reliability Engineering and communicate best practices for new features/functionality
Participate in Incident Response when applicable, including investigation through to resolution, PIR, and communication with customers via Bridge calls where necessary

Grafana Labs is a 100% remote company.

In the UK, the Base compensation range for this role is £103,958 - £124,750. Actual compensation may vary based on level, experience, and skillset as assessed in the interview process.

Benefits include equity, bonus (if applicable) and other benefits listed here: https://grafana.com/about/careers/#jobs.

We also encourage the use of modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a company-funded usage budget. You’ll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro).