Staff Backend Engineer - Application Core Services, Stacks

📍 Canada
CAD 186,400-223,600 per year
MIDDLE
✅ Remote

Used Tools & Technologies

Go

Required Skills & Competences

Security @ 3 Grafana @ 3 Kubernetes @ 2 Terraform @ 2 TypeScript @ 3 GCP @ 2 Distributed Systems @ 3 AWS @ 2 Azure @ 2 Helm @ 2 Mentoring @ 3 Node.js @ 3 AI @ 3

Details

Grafana Labs is a remote-first, open-source company used by millions globally and supporting large enterprises. The Application Core Services (AppCore) team partners with Cloud, Enterprise, and Grafana teams to deliver reliable internal and customer-facing systems that power critical parts of the Grafana business. AppCore builds on the grafana.com platform to create custom solutions and integrations across systems that support a modern software company, including billing, provisioning, marketplace integrations, and the customer portal.

This role is a remote opportunity for applicants located in Canadian time zones (EST + CST only at this time). The AppCore Stacks squad owns the systems that create, configure, reconcile, migrate, and operate Grafana Cloud stacks at scale. A stack is the customer-facing Grafana Cloud environment that connects an organization to Grafana and backend services (Mimir, Loki, Tempo), plugins, dashboards, data sources, and stack-level configuration.

Responsibilities

  • Design, build, and operate reconciliation systems (including the Stack State Service) to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
  • Manage rollout mechanisms for provisioned plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration.
  • Support new region and cluster rollouts and the operational paths required to bring stacks online safely in new Grafana Cloud regions.
  • Improve incident response and recovery paths for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures.
  • Partner with Product, Hosted Grafana, Infrastructure, Support, and adjacent AppCore squads on customer-impacting stack lifecycle work.
  • Contribute to roadmap planning, technical design, OnCall improvements, and long-term simplification of stack operations.
  • Own production behavior of built systems: runbooks, dashboards, alerts, reconciliation safety, rollout controls, and recovery procedures. Debug across service boundaries and make careful changes that affect customer stacks.
  • Participate in an on-call rotation (follow-the-sun coverage; company aims for ~12 daylight hours/day coverage).
  • Use AI coding assistants as desired within security guidelines; company provides an AI usage budget and access to frontier models for development productivity.

Requirements

  • At least 1 year of fully remote work experience.
  • Experience working on a large SaaS platform and dealing with distributed systems challenges (scalability, multi-tenancy, data isolation, HA, etc.).
  • Professional experience with Golang and willingness to work across backend service and application code.
  • Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
  • Experience delivering projects end-to-end: gathering requirements, shipping products, and iterating based on feedback.
  • Ability to write clean, robust, well-tested software that others can understand, operate, and maintain.
  • Experience mentoring junior engineers in a collaborative but asynchronous environment.
  • Ability to decompose complex challenges into safe, iterative increments and communicate tradeoffs to stakeholders.
  • Experience participating in blameless incident response and writing high-quality post-incident reviews.

Bonus Points

  • Experience with TypeScript/Node.js.
  • Experience with Kubernetes control-plane patterns, operators, reconcilers, or desired-state systems.
  • Experience with Jsonnet/Tanka, Terraform, Flux, Argo, or similar deployment/configuration tooling.
  • Experience working on SaaS provisioning, tenancy, regional expansion, plugin rollout, or customer lifecycle systems.
  • Experience with incident response involving configuration drift, partial failure, or cross-service state mismatch.

Compensation & Rewards

  • Base compensation range in Canada: CAD 186,368 - CAD 223,642. Actual compensation may vary based on level, experience, and skillset.
  • Roles include Restricted Stock Units (RSUs), potential bonus (if applicable), and other benefits listed by the company.
  • 100% remote company culture, global annual leave policy of 30 days per annum (3 days reserved for Grafana Shutdown Days), and in-person onboarding.

Other Details

  • Remote-first company; role expects collaboration across time zones and ownership of critical systems.
  • Equal opportunity employer. The recruitment process may utilize AI tools to assist in matching CVs to job postings.