Senior Software Engineer - Grafana Databases, Managed Services
Used Tools & Technologies
PostgreSQLRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Grafana @ 4
Kafka @ 4
Kubernetes @ 4
Linux @ 4
Terraform @ 3
GCP @ 3
Distributed Systems @ 4
AWS @ 3
Azure @ 3
Communication @ 4
Helm @ 3
Networking @ 4
SRE @ 7
Cassandra @ 4
Snowflake @ 4
Observability @ 4
AI @ 4
ClickHouse @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Grafana Labs is a remote-first, open-source company with more than 20M users of Grafana and a managed product offering (Grafana Cloud) used by thousands of companies. The Managed Services team within the Databases department owns and operates shared, production-critical infrastructure that powers Grafana Cloud’s database products (Mimir, Loki, Tempo). This includes operating 100+ WarpStream clusters across multiple cloud providers and regions and working with high-volume analytical and storage systems where latency, compression, storage layout, and scaling characteristics matter deeply.
Role overview
This is a Senior Engineer role on the Managed Services team focused on operating and evolving multi-cloud streaming clusters and related database infrastructure. The role blends deep distributed systems work with reliability, scaling, and operational excellence, and includes on-call responsibilities.
Responsibilities
- Operate and evolve 100+ multi-cloud streaming clusters and related database infrastructure.
- Diagnose and eliminate cross-layer failure modes (object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.).
- Design safe upgrade and rollout strategies at scale.
- Improve observability, automation, and operational ergonomics.
- Partner closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance.
- Work directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, and compression trade-offs.
- Serve as a primary escalation point and participate in on-call rotations and incident response.
- Own vendor relationships (e.g., WarpStream Labs and others).
- Participate in PR review, design documents, automation, tooling, and post-incident reviews.
Requirements
- 6+ years of engineering experience, including meaningful time in SRE, platform engineering, production/infrastructure engineering, or distributed systems roles.
- Experience operating distributed systems in production (examples: Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, Cassandra).
- Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Solid understanding of distributed systems design and large-scale system trade-offs.
- Proficiency in at least one programming language (Go preferred).
- Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior.
- Experience participating in blameless incident response and writing high-quality post-incident reviews.
- Clear communication skills and ability to work autonomously in a remote environment.
Compensation & Rewards
- Base compensation range (United Kingdom): GBP 91,755 - GBP 110,106.
- Roles include Restricted Stock Units (RSUs). Other benefits, equity, and bonus information are available via the company careers page.
Location & work arrangement
- This is a remote opportunity targeted at applicants living in UK time zones only. Grafana Labs is remote-first and uses video conferencing for collaboration.
- The role includes an on-call component and coordinated coverage across regions (~12 daylight hours per day coverage target).
- In-person onboarding is provided.
Other notes
- Grafana Labs permits use of modern AI coding assistants within security guidelines and provides a company-funded usage budget. Frontier models (examples given) are available as optional tools.
- Grafana Labs is an equal opportunity employer and provides a global remote culture with defined career growth pathways and company-wide transparency.