Staff Software Engineer - Databases SRE

📍 Germany
📍 Spain
📍 United Kingdom
📍 Ireland
📍 Sweden
EUR 117,600-141,100 per year
SENIOR
✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Security @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 7 Linux @ 4 Terraform @ 3 Python @ 4 GCP @ 4 Java @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 7 Helm @ 3 Mentoring @ 4 Networking @ 4 SRE @ 4 Technical Leadership @ 4 Design Patterns @ 4 Observability @ 4 AI @ 4

Details

Grafana Labs is the company behind the open observability cloud and Grafana Cloud, a fully managed observability platform. We are a 100% remote company with a global team and an open-source culture. This role is a remote opportunity; we are looking for candidates from the United Kingdom, Sweden, Spain, Germany, or Ireland.

About the role

You will join the SRE team embedded within the Mimir, Loki, and Tempo squads to increase the reliability of Grafana Cloud database products (Mimir, Loki, Tempo, Pyroscope) delivered as SaaS across AWS, GCP, and Azure. The role focuses on production reliability for high-SLA, multi-tenant, and complex customer environments.

Responsibilities

  • Partner closely with product engineering squads (embedded model).
  • Own production reliability for high-SLA and complex customer environments.
  • Design and implement automation to scale reliability practices and eliminate toil.
  • Ensure customers meet SLO targets; define and evolve per-tenant SLOs and reliability models.
  • Proactively reduce SLO burn to prevent repeat incidents.
  • Serve as a primary escalation point and participate on-call for relevant incidents.
  • Lead customer-impacting incident response and post-incident reviews (PIRs).
  • Contribute to design docs and code reviews; influence feature design for scalability and operability.
  • Improve alert quality and reduce noisy escalations.
  • Improve observability of customers and develop fault-tolerant design patterns.

Requirements

  • 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering; preference for formal customer reliability engineering experience.
  • Strong Kubernetes experience running in AWS, GCP, or Azure.
  • Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
  • Strong experience operating multi-tenant systems in production and designing/implementing SLOs.
  • Experience with one or more programming languages (examples given: Go, Python, Java).
  • Experience with Linux operating system internals; some knowledge of networking, cloud storage, and scaling.
  • Excellent problem-solving and troubleshooting skills; experience in blame-free incident response and writing high-quality PIRs.
  • Technical leadership experience: leading projects, mentoring engineers, and serving as a force-multiplier.
  • Comfortable partnering deeply with product engineering teams; strong written and verbal communication for design docs, code review, and incident communication.

Day-to-day

  • Regular 1:1s with manager and colleagues.
  • Review and create SLOs; investigate and reduce SLO budget burn via monitoring, automation, self-healing, and autoscaling.
  • Participate in PR reviews, design docs, and collaborate across teams.
  • Participate in incident response from investigation through resolution and customer communication when needed.

Compensation & Benefits

  • Ireland base compensation range: €117,600 - €141,120 (actual compensation may vary by level, experience, and skillset).
  • Benefits include equity, bonus (if applicable), global annual leave policy (30 days per annum; 3 days reserved for Grafana Shutdown Days), in-person onboarding, and other benefits listed by Grafana Labs.
  • Company-funded AI tooling allowance and encouragement of pragmatic AI-assisted development within security guidelines.

Equal Opportunity

Grafana Labs is an equal opportunity employer and welcomes applications from diverse backgrounds. For details on how personal data is used in the application process, see Grafana Labs' privacy policy.