Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 4
Grafana @ 4
Kubernetes @ 3
Linux @ 4
Terraform @ 3
Python @ 4
GCP @ 4
Java @ 4
Hiring @ 4
Leadership @ 7
AWS @ 4
Azure @ 4
Communication @ 4
Helm @ 3
Mentoring @ 7
Networking @ 4
SRE @ 4
Technical Leadership @ 7
Design Patterns @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is hiring a Staff Software Engineer - SRE to support high-value Grafana Cloud customers by improving the reliability of Grafana Cloud databases based on Mimir, Loki, Tempo, and Pyroscope. These databases are delivered as a SaaS product across AWS, GCP, and Azure in all regions. This is a remote opportunity; Grafana Labs is specifically looking for candidates from the United Kingdom, Sweden, Spain, or Germany.
Responsibilities
- Partner closely with product engineering squads (embedded model)
- Own production reliability for high-SLA and complex customer environments
- Design and implement automation to scale reliability practices
- Ensure customers meet SLO targets and define/evolve per-tenant SLOs and reliability models
- Proactively reduce SLO burn to prevent repeat incidents
- Serve as a primary escalation point and participate on-call for relevant incidents
- Lead customer-impacting incident response and post-incident reviews (PIRs/post-mortems)
- Contribute to design docs and code reviews
- Influence feature design to ensure production scalability and operability
- Build automation to eliminate toil and improve alert quality to reduce noisy escalations
Requirements
- 8+ years engineering experience, with 4+ years in SRE/CRE/production engineering (strong preference for formal customer reliability engineering experience)
- Strong Kubernetes experience in AWS, GCP, or Azure; familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
- Experience operating multi-tenant systems in production
- Strong experience designing and implementing SLOs
- Experience with one or more programming languages (examples: Go, Python, Java)
- Experience with Linux operating systems internals, and knowledge of networking, cloud storage, and scaling
- Excellent problem-solving and troubleshooting skills
- Experience participating in blame-free incident response, following up on actions, and writing high-quality PIRs/post-incident reviews
- Ability to reason about performance, scaling, and failure modes
- Strong technical leadership: leading projects, mentoring engineers, and serving as a force-multiplier
- Comfortable partnering deeply with product engineering teams and working autonomously
Day-to-day
- Regular 1:1s with manager and colleagues
- Review and create SLOs; investigate and implement ways to reduce budget burn through monitoring, automation, self-healing, auto-scaling, etc.
- Improve observability of customers within their environments
- Design and implement solutions for reliability and scalability to meet growing demand
- Develop fault-tolerant design patterns and consider reliability across the service lifecycle
- Collaborate with engineering leaders on product strategy, roadmaps, and technical designs
- Participate in PR review and design doc collaboration
- Teach SRE best practices and participate in incident response and customer communication when necessary
Compensation
- In Spain, the base compensation range for this role is €94,025 - €112,830. Actual compensation may vary based on level, experience, and skillset as assessed in the interview process. Benefits may include equity, bonus (if applicable), and other benefits listed by Grafana Labs.
Why You’ll Thrive at Grafana Labs
- 100% remote, global culture with a high-trust, low-ego environment
- Scaling organization, transparent communication, and innovation-driven teams
- Open source roots and empowered teams with career growth pathways
- In-person onboarding and an annual leave policy (30 days per annum, with 3 reserved Grafana Shutdown Days)
Equal Opportunity
Grafana Labs is an equal opportunities employer and welcomes applications from everyone. The company may utilize AI tools in the recruitment process to assist with CV matching; recruitment will continue to be reviewed manually by the team.