Staff Software Engineer - Databases SRE

📍 Germany
📍 Spain
📍 United Kingdom
📍 Sweden
GBP 104,000-124,800 per year
SENIOR
✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Go @ 4 Grafana @ 4 Kubernetes @ 7 Linux @ 4 Terraform @ 3 Python @ 4 GCP @ 4 Java @ 4 Hiring @ 4 Leadership @ 7 AWS @ 4 Azure @ 4 Communication @ 7 Helm @ 3 Mentoring @ 7 Networking @ 4 SRE @ 4 Technical Leadership @ 7 Design Patterns @ 4 Observability @ 4

Details

Grafana Labs is hiring a Staff Software Engineer - SRE to support Grafana Cloud’s highest-value customers by increasing the reliability of cloud databases based on Mimir, Loki, Tempo, and Pyroscope. The role is remote and the team provides these databases as a SaaS product from AWS, GCP, and Azure across all regions. The SRE team is embedded within the Mimir, Loki, and Tempo squads and focuses on delivering exceptional reliability for high-SLA customers.

Responsibilities

  • Partner closely with product engineering squads using an embedded model.
  • Own production reliability for high-SLA and complex customer environments.
  • Design and implement automation to scale reliability practices and eliminate toil.
  • Ensure customers meet SLO targets, define and evolve per-tenant SLOs and reliability models.
  • Proactively reduce SLO burn to prevent repeat incidents.
  • Serve as a primary escalation point and participate on-call for relevant incidents.
  • Lead customer-impacting incident response and post-incident reviews (PIRs/post-mortems).
  • Contribute to design documents and code reviews, influence feature design for scalability and operability.
  • Improve alert quality and reduce noisy escalations.
  • Teach and communicate SRE best practices to engineering teams.

Requirements

  • 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering (strong preference for formal customer reliability engineering experience).
  • Strong Kubernetes experience in at least one cloud provider (AWS, GCP, or Azure).
  • Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
  • Strong experience with technical leadership: leading projects, mentoring engineers, and acting as a force-multiplier.
  • Experience operating multi-tenant systems in production and designing/implementing SLOs.
  • Experience with one or more programming languages (examples listed: Go, Python, Java).
  • Knowledge of Linux operating system internals, networking, cloud storage, performance, scaling, and failure modes.
  • Excellent problem-solving and troubleshooting skills and calm, active participation in blame-free incident response.
  • Ability to partner deeply with product engineering teams and work with autonomy.
  • Strong communication skills, intellectual curiosity, transparency, bias for action, and a collaborative mindset.

Day-to-day

  • Regular 1:1s with manager and colleagues.
  • Review and create SLOs; investigate and reduce SLO budget burn through monitoring, automation, self-healing, auto-scaling, etc.
  • Improve observability of customers within their environments.
  • Design and implement solutions for reliability and scalability to meet growing demand.
  • Develop fault-tolerant design patterns and participate in PR reviews and design docs.
  • Participate in incident response including investigation, resolution, PIR, and customer communication via Bridge calls when needed.

Compensation & Benefits

  • UK base compensation range: £103,958 - £124,750 (actual compensation may vary by level, experience, and skillset). Benefits include equity, possible bonus, and other benefits listed on the company careers page.
  • 100% remote company with in-person onboarding; global annual leave policy of 30 days per annum (subject to local legislation).