Used Tools & Technologies
GoRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Grafana @ 3
Kubernetes @ 2
Terraform @ 2
TypeScript @ 3
GCP @ 2
Distributed Systems @ 2
AWS @ 2
Azure @ 2
Helm @ 2
Node.js @ 3
Observability @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company building observability software used globally. The AppCore group within Platform (Foundations) builds essential systems driving Grafana's business operations, including billing, provisioning, marketplace integrations, and the user portal. The AppCore Stacks squad specifically owns the systems that create, configure, reconcile, migrate, and operate Grafana Cloud stacks at scale.
Responsibilities
- Design, build, and operate reconciliation systems (including the Stack State Service) to track desired stack state and detect/repair drift across stack templates, grafana.com state, Hosted Grafana, and customer stack configuration
- Collaborate across SSS, grafana.com, and deployment configurations to keep stack lifecycle workflows reliable, observable, and resilient
- Improve operational efficiency and reduce deployment complexity (e.g., single-PR regional SSS deployment)
- Manage rollout mechanisms for plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration
- Support new region and cluster rollouts and the operational paths required to bring stacks online safely
- Improve incident response and recovery for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures
- Partner with Product, Hosted Grafana, Infrastructure, Support, and adjacent AppCore squads on customer-impacting work
- Contribute to roadmap planning, technical design, OnCall improvements, runbooks, dashboards, alerts, reconciliation safety, rollout controls, and recovery procedures
- Debug across service boundaries and make careful changes to systems that affect customer stacks; participate in follow-the-sun OnCall when ready
Requirements
- At least 1 year of fully remote work experience
- Some experience working on a SaaS platform and familiarity with distributed systems concepts (scalability, multi-tenancy, HA)
- Professional experience with Golang and willingness to work across backend service and application code
- Care about developer and user experience and product quality; write clean, well-tested, maintainable software
- Experience contributing to delivery of projects from brainstorming to shipping
- Ability to break tasks down and execute iteratively; collaborate across teams and align work with other squads
- Familiarity with Kubernetes in AWS, GCP, or Azure and exposure to infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
- Experience participating in blameless incident response and post-incident reviews
Bonus Points
- Experience with TypeScript/Node.js
- Experience with Kubernetes control-plane patterns, operators, reconcilers, or desired-state systems
- Experience with Jsonnet/Tanka, Terraform, Flux, Argo, or similar deployment/configuration tooling
- Experience working on SaaS provisioning, tenancy, regional expansion, plugin rollout, or customer lifecycle systems
- Experience with incident response involving configuration drift, partial failure, or cross-service state mismatch
Compensation & Rewards
- United Kingdom compensation range: GBP 72,000 - GBP 90,000 (country-specific ranges apply; actual compensation varies by level, experience, and skillset)
- All roles include RSUs and a global annual leave policy (30 days per annum, with 3 Grafana Shutdown Days)
- 100% remote company culture; in-person onboarding
Additional Information
- This role is available for candidates located in the United Kingdom, Germany, Spain, Ireland, and Sweden
- Grafana Labs embraces AI-assisted development practices and encourages thoughtful use of AI tools in engineering workflows
- Equal opportunity employer and privacy policy provided for applicants