Used Tools & Technologies
GoRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Grafana @ 3
Kubernetes @ 2
Terraform @ 2
TypeScript @ 3
GCP @ 2
Distributed Systems @ 2
AWS @ 2
Azure @ 2
Communication @ 3
Helm @ 2
Node.js @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company used by millions worldwide. The Platform Foundations group builds internal engineering platform services and the systems that support Grafana Cloud, Enterprise, and grafana.com integrations. AppCore focuses on core business and customer workflows such as billing, provisioning, marketplace integrations, and the user portal.
Responsibilities
- Design, build, and operate reconciliation systems and control-plane services that manage Grafana Cloud stacks at scale (Stack State Service - SSS).
- Track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
- Collaborate across SSS, grafana.com, and deployment configurations to ensure reliable, observable, and resilient stack lifecycle workflows.
- Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
- Manage rollout mechanisms for provisioned plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration.
- Support new region and cluster rollouts and the operational paths required to bring stacks online safely in new Grafana Cloud regions.
- Improve incident response and recovery for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures.
- Partner with Product, Hosted Grafana, Infrastructure, Support, and adjacent AppCore squads on customer-impacting stack lifecycle work.
- Contribute to roadmap planning, technical design, OnCall improvements, runbooks, dashboards, alerts, reconciliation safety, rollout controls, and recovery procedures.
- Own production behavior of systems you build; debug across service boundaries and make careful changes affecting customer stacks.
Requirements
- At least 1 year of fully remote work experience.
- Some experience working on a SaaS platform and familiarity with distributed systems concepts (scalability, multi-tenancy, HA).
- Professional experience with Golang and willingness to work across backend service and application code.
- Care about developer and user experience and product quality.
- Experience contributing to project delivery from brainstorming to shipping.
- Write clean, well-tested, maintainable software.
- Ability to take well-defined tasks, break them down, and iterate to deliver working solutions and gather feedback.
- Willingness to collaborate across teams and align work with other squads and stakeholders.
- Familiarity with Kubernetes in AWS, GCP, or Azure, and exposure to infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Experience participating in blameless incident response and contributing to post-incident reviews.
Bonus Points
- Experience with TypeScript/Node.js.
- Experience with Kubernetes control-plane patterns, operators, reconcilers, or desired-state systems.
- Experience with Jsonnet/Tanka, Terraform, Flux, Argo, or similar deployment/configuration tooling.
- Experience working on SaaS provisioning, tenancy, regional expansion, plugin rollout, or customer lifecycle systems.
- Experience with incident response involving configuration drift, partial failure, or cross-service state mismatch.
Compensation & Rewards
- In Spain the base compensation range for this role is EUR 65,000 - EUR 81,000 per year.
- All roles include Restricted Stock Units (RSUs).
- Compensation ranges are country specific; different markets may have different pay ranges discussed by recruiters.
Why You’ll Thrive at Grafana Labs
- Remote-first, global culture with 100% remote roles.
- Opportunity to work on high-impact work in a scaling organization with transparent communication and open-source roots.
- In-person onboarding and 30 days annual leave (subject to local legislation).
Other Notes
- Grafana Labs embraces AI-assisted development practices and may utilize AI tools in recruitment to match CVs to job postings.
- This role is available for candidates located in the United Kingdom, Germany, Spain, Ireland, and Sweden.
- Equal Opportunity Employer statement and privacy policy link included in the original posting.