Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 4
Grafana @ 4
Kubernetes @ 7
Linux @ 4
Terraform @ 3
Python @ 4
GCP @ 4
Java @ 4
Hiring @ 4
Leadership @ 7
AWS @ 4
Azure @ 4
Communication @ 7
Helm @ 3
Mentoring @ 7
Networking @ 4
SRE @ 4
Technical Leadership @ 7
Design Patterns @ 4
Observability @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is hiring a Staff Software Engineer - SRE to support Grafana Cloud’s highest-value customers by increasing the reliability of cloud databases based on Mimir, Loki, Tempo, and Pyroscope. The role is remote and the team provides these databases as a SaaS product from AWS, GCP, and Azure across all regions. The SRE team is embedded within the Mimir, Loki, and Tempo squads and focuses on delivering exceptional reliability for high-SLA customers.
Responsibilities
- Partner closely with product engineering squads using an embedded model.
- Own production reliability for high-SLA and complex customer environments.
- Design and implement automation to scale reliability practices and eliminate toil.
- Ensure customers meet SLO targets, define and evolve per-tenant SLOs and reliability models.
- Proactively reduce SLO burn to prevent repeat incidents.
- Serve as a primary escalation point and participate on-call for relevant incidents.
- Lead customer-impacting incident response and post-incident reviews (PIRs/post-mortems).
- Contribute to design documents and code reviews, influence feature design for scalability and operability.
- Improve alert quality and reduce noisy escalations.
- Teach and communicate SRE best practices to engineering teams.
Requirements
- 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering (strong preference for formal customer reliability engineering experience).
- Strong Kubernetes experience in at least one cloud provider (AWS, GCP, or Azure).
- Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Strong experience with technical leadership: leading projects, mentoring engineers, and acting as a force-multiplier.
- Experience operating multi-tenant systems in production and designing/implementing SLOs.
- Experience with one or more programming languages (examples listed: Go, Python, Java).
- Knowledge of Linux operating system internals, networking, cloud storage, performance, scaling, and failure modes.
- Excellent problem-solving and troubleshooting skills and calm, active participation in blame-free incident response.
- Ability to partner deeply with product engineering teams and work with autonomy.
- Strong communication skills, intellectual curiosity, transparency, bias for action, and a collaborative mindset.
Day-to-day
- Regular 1:1s with manager and colleagues.
- Review and create SLOs; investigate and reduce SLO budget burn through monitoring, automation, self-healing, auto-scaling, etc.
- Improve observability of customers within their environments.
- Design and implement solutions for reliability and scalability to meet growing demand.
- Develop fault-tolerant design patterns and participate in PR reviews and design docs.
- Participate in incident response including investigation, resolution, PIR, and customer communication via Bridge calls when needed.
Compensation & Benefits
- UK base compensation range: £103,958 - £124,750 (actual compensation may vary by level, experience, and skillset). Benefits include equity, possible bonus, and other benefits listed on the company careers page.
- 100% remote company with in-person onboarding; global annual leave policy of 30 days per annum (subject to local legislation).