Staff Software Engineer - Databases SRE

📍 Ireland

EUR 117,600-141,100 per year

SENIOR

✅ Remote

Tech Stack
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

AI AWS @ 7 Azure @ 7 GCP @ 7 Go @ 4 Grafana Helm @ 3 Java @ 4 Kubernetes @ 7 Leadership @ 7 Linux @ 4 Mentoring @ 7 Networking @ 4 Python @ 4 SRE @ 7 Security Technical Leadership @ 7 Terraform @ 3

Details

Grafana Labs is looking for a Staff Software Engineer - SRE to help support Grafana Cloud’s highest value customers by increasing the reliability of Cloud databases based on Mimir, Loki, Tempo, and Pyroscope.

This is a remote opportunity. The company is 100% remote and provides these databases as a SaaS product from AWS, GCP, and Azure across all regions.

Responsibilities

Partner closely with product engineering squads (embedded model)
Own production reliability for high-SLA and complex customer environments
Design and implement automation to scale reliability practices
Ensure customers meet their SLO targets
Define and evolve per-tenant SLOs and reliability models
Proactively reduce SLO burn to prevent repeat incidents
Serve as a primary escalation point and on-call for relevant incidents
Lead customer-impacting incident response and post-incident reviews
Contribute to design docs and code reviews
Influence feature design to ensure production scalability and operability
Build automation to eliminate toil where needed
Improve alert quality and reduce noisy escalations

You’ll also help invest in developer productivity, including use of modern AI coding assistants as part of daily workflow (within security guidelines), with a company-funded usage budget and access to frontier models.

Requirements

8+ years engineering experience, 4+ in SRE/CRE/production engineering (preference for formal customer reliability engineering experience)
Strong Kubernetes experience in AWS, GCP, or Azure
Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
Strong experience with technical leadership (leading a team, mentoring, force-multiplier)
Experience operating multi-tenant systems in production
Strong experience designing and implementing SLOs
Experience with one or more programming languages (Go, Python, Java, etc.)
Experience with Linux operating systems internals and some knowledge of networking, cloud storage, and scaling
Excellent problem-solving and troubleshooting skills
Experience participating in blame-free Incident Response, following up on actions, and writing high quality PIRs (Post Incident Reviews)
Ability to reason about performance, scaling, and failure modes
Comfortable working with autonomy and self-direction in an engineering team
Ability to partner deeply with product engineering teams
Intellectual curiosity, default to transparency, high bias towards action, and kindness

Benefits

Equity
Bonus (if applicable)
Other benefits listed at https://grafana.com/about/careers/#jobs
Global annual leave policy of 30 days per annum, with 3 days reserved for Grafana Shutdown Days
In-person onboarding