Senior Cloud Infrastructure Engineer

📍 Switzerland
📍 Germany
📍 Spain
📍 France
📍 United Kingdom
📍 Netherlands
📍 Zurich, Switzerland
📍 Munich, Germany
📍 Berlin, Germany
📍 Paris, France
📍 London, United Kingdom

EUR 90,000-160,000 per year

SENIOR

✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Security @ 4 Docker @ 4 Kubernetes @ 4 Redis @ 4 IaC @ 4 Terraform @ 4 TypeScript @ 4 CI/CD @ 4 Datadog @ 4 Hiring @ 4 AWS @ 4 Helm @ 4 Networking @ 4 PostgreSQL @ 4 SRE @ 7 Next.js @ 3 CloudFormation @ 4 Experimentation @ 4 LLM @ 4 Compliance @ 4 Observability @ 4 AI @ 4 ClickHouse @ 4

Details

Langfuse is an open source LLM engineering platform focused on tracing, evaluation, and prompt management. The company is now part of ClickHouse and runs a cloud and self-hosted offering trusted by large customers. The team is engineering-heavy with offices in Berlin and San Francisco and is hiring for engineering in EU timezones. The role expects approximately one week per month in the Berlin office and is structured as a hybrid position.

Responsibilities

Own Langfuse Cloud operations: run production environments on AWS (ECS / Fargate) and ClickHouse Cloud, manage deployments, autoscaling, capacity planning, and cost optimization.
Build and maintain world-class observability: own Datadog setup end-to-end (dashboards, alerts, SLOs) and ensure early detection of degradations.
Make self-hosting effortless: own and evolve Helm charts, Docker Compose configuration, and deployment documentation for single-node to multi-region enterprise deployments.
Automate everything: implement and improve CI/CD pipelines, infrastructure-as-code, automated scaling, and zero-downtime deployments.
Scale for future product directions: design infrastructure to handle 10x growth and new features (long-running agents, real-time evaluation).
Harden security and compliance for cloud and self-hosted deployments as enterprise adoption grows.

Requirements

Strong infrastructure or SRE experience running systems at scale and improving reliability.
Experience operating production workloads on AWS (ECS/Fargate, networking, IAM, S3) or comparable hyperscale vendors.
Comfortable with container orchestration (Kubernetes and/or ECS), Helm charts, and Docker.
Experience with infrastructure-as-code (Terraform, Pulumi, CloudFormation, or similar).
Strong monitoring and observability instincts; experience building dashboards and alerts that catch real problems (Datadog experience is a plus).
Strong opinions and discipline around reliability, automation, and safe infrastructure change processes.
Interest in open source and willingness to help users debug self-hosted deployments.
Comfortable working in a small, accountable team where individual output is visible.
CS or quantitative degree preferred.

Bonus (nice-to-have)

Experience with ClickHouse Cloud or other managed analytical databases.
Background operating high-throughput event processing or observability infrastructure.
Contributions to open source infrastructure tooling (Helm charts, Terraform modules, etc.).
Former founder.

Tech Stack & Tools

Cloud / infra: AWS (ECS / Fargate), ClickHouse Cloud, S3, IAM, networking
Orchestration & packaging: Kubernetes, Helm charts, Docker, Docker Compose
IaC & automation: Terraform, Pulumi, CloudFormation, CI/CD pipelines
Observability: Datadog (dashboards, alerts, SLOs)
Data & storage: ClickHouse (tracing), PostgreSQL, Redis
Application stack mentioned: TypeScript monorepo (Next.js frontend, Express workers) — familiarity helpful when collaborating with product teams

Process & Culture

Fast hiring process (the company states the full process to offer can take less than 7 days).
Emphasis on ownership: engineers propose solutions (RFCs) and ship them; code reviews are used for mentorship.
Maker schedule with limited recurring meetings (weekly check-in and Friday demo). The team encourages experimentation with AI tooling and collaboration across the org.