Staff Backend Engineer - Application Core Services, Stacks
Used Tools & Technologies
GoRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Grafana @ 3
Kubernetes @ 2
Terraform @ 2
TypeScript @ 3
GCP @ 2
Distributed Systems @ 3
AWS @ 2
Azure @ 2
Helm @ 2
Mentoring @ 3
Node.js @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company used by millions globally and supporting large enterprises. The Application Core Services (AppCore) team partners with Cloud, Enterprise, and Grafana teams to deliver reliable internal and customer-facing systems that power critical parts of the Grafana business. AppCore builds on the grafana.com platform to create custom solutions and integrations across systems that support a modern software company, including billing, provisioning, marketplace integrations, and the customer portal.
This role is a remote opportunity for applicants located in Canadian time zones (EST + CST only at this time). The AppCore Stacks squad owns the systems that create, configure, reconcile, migrate, and operate Grafana Cloud stacks at scale. A stack is the customer-facing Grafana Cloud environment that connects an organization to Grafana and backend services (Mimir, Loki, Tempo), plugins, dashboards, data sources, and stack-level configuration.
Responsibilities
- Design, build, and operate reconciliation systems (including the Stack State Service) to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
- Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
- Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
- Manage rollout mechanisms for provisioned plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration.
- Support new region and cluster rollouts and the operational paths required to bring stacks online safely in new Grafana Cloud regions.
- Improve incident response and recovery paths for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures.
- Partner with Product, Hosted Grafana, Infrastructure, Support, and adjacent AppCore squads on customer-impacting stack lifecycle work.
- Contribute to roadmap planning, technical design, OnCall improvements, and long-term simplification of stack operations.
- Own production behavior of built systems: runbooks, dashboards, alerts, reconciliation safety, rollout controls, and recovery procedures. Debug across service boundaries and make careful changes that affect customer stacks.
- Participate in an on-call rotation (follow-the-sun coverage; company aims for ~12 daylight hours/day coverage).
- Use AI coding assistants as desired within security guidelines; company provides an AI usage budget and access to frontier models for development productivity.
Requirements
- At least 1 year of fully remote work experience.
- Experience working on a large SaaS platform and dealing with distributed systems challenges (scalability, multi-tenancy, data isolation, HA, etc.).
- Professional experience with Golang and willingness to work across backend service and application code.
- Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Experience delivering projects end-to-end: gathering requirements, shipping products, and iterating based on feedback.
- Ability to write clean, robust, well-tested software that others can understand, operate, and maintain.
- Experience mentoring junior engineers in a collaborative but asynchronous environment.
- Ability to decompose complex challenges into safe, iterative increments and communicate tradeoffs to stakeholders.
- Experience participating in blameless incident response and writing high-quality post-incident reviews.
Bonus Points
- Experience with TypeScript/Node.js.
- Experience with Kubernetes control-plane patterns, operators, reconcilers, or desired-state systems.
- Experience with Jsonnet/Tanka, Terraform, Flux, Argo, or similar deployment/configuration tooling.
- Experience working on SaaS provisioning, tenancy, regional expansion, plugin rollout, or customer lifecycle systems.
- Experience with incident response involving configuration drift, partial failure, or cross-service state mismatch.
Compensation & Rewards
- Base compensation range in Canada: CAD 186,368 - CAD 223,642. Actual compensation may vary based on level, experience, and skillset.
- Roles include Restricted Stock Units (RSUs), potential bonus (if applicable), and other benefits listed by the company.
- 100% remote company culture, global annual leave policy of 30 days per annum (3 days reserved for Grafana Shutdown Days), and in-person onboarding.
Other Details
- Remote-first company; role expects collaboration across time zones and ownership of critical systems.
- Equal opportunity employer. The recruitment process may utilize AI tools to assist in matching CVs to job postings.