Senior Observability Engineer

at Groq
USD 215,000-278,100 per year
SENIOR
✅ Remote

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 Terraform @ 7 TypeScript @ 4 Communication @ 4 Networking @ 4 IaaS @ 7 Rust @ 4 Debugging @ 4 OpenTelemetry @ 4

Details

Groq delivers fast, efficient AI inference. Our LPU-based system powers GroqCloud™, giving businesses and developers the speed and scale they need. From our Bay Area roots to our growing global presence, we are on a mission to make high performance AI compute more accessible and affordable.

Mission

Ensure the reliability, scalability, and performance of Groq’s observability tools and services for provisioning and managing the full lifecycle of Groq hardware, software, and networking systems at massive scale.

The Team

The observability team builds the monitoring and observability infrastructure and tooling that supports Groq’s inferencing hardware at massive scale, both in the cloud and our own datacenters. Our mission is to enable the reliability, scalability, and performance of Groq’s tools and service across a wide range of signals and appliances, to support teams in production excellence, and to evangelize technical and cultural best practices around alerting, instrumentation, debugging, SLOs, and on-call.

Responsibilities

  • Build and maintain comprehensive observability systems at massive scale with strong uptime and reliability.
  • Iterate on, maintain, update, automate, and dogfood your own systems; create monitoring best practices for the organization.
  • Instrument Kubernetes clusters, applications, and datacenter infrastructure components such as switches, PDUs, environmental sensors, cameras, chillers, etc.
  • Work with signals including effective canonical logging and cost control; tracing (context propagation, tail sampling strategies, attribute enrichment, querying); and metrics from hosts, kube-state-metrics, kubelet, IPMI, SNMP.
  • Advise and teach teams on instrumenting applications in a variety of languages (Rust, C++, TypeScript, GoLang), implementing sensible SLO and alerting strategies, and on-call best practices.
  • Continuously learn and expand knowledge across domains from networking to FPGA design as needed.

Requirements / Ideal Candidate

  • 4+ years of experience in observability as a core responsibility of previous roles.
  • Deep understanding of cloud-native technologies and infrastructure as a service (IaaS) such as Terraform and Flux.
  • Experience instrumenting large Kubernetes clusters and building operators.
  • Expertise in standing up and running monitoring, observability, and alerting systems — OpenTelemetry Tracing and Collector, Grafana/Prometheus, PagerDuty, AlertManager, IPMI, SNMP, etc.
  • Strong analytical and problem-solving skills focused on root cause analysis and mitigation.
  • Excellent communication and teamwork skills; ability to collaborate effectively across engineering teams.

Attributes of a Groqster

  • Humility – Egos are checked at the door
  • Collaborative & Team Savvy – We make up the smartest person in the room, together
  • Growth & Giver Mindset – Learn it all versus know it all, we share knowledge generously
  • Curious & Innovative – Take a creative approach to projects, problems, and design
  • Passion, Grit, & Boldness – No-limit thinking, fueling informed risk taking

Compensation

Base salary range (United States): $214,952 to $278,070. Base salary is part of a comprehensive compensation package including equity and benefits. Compensation for candidates outside the USA will depend on the local market.

Equal Opportunity & Accommodations

Groq is an Equal Opportunity Employer committed to an inclusive environment. Reasonable accommodations are available for applicants with disabilities; contact [email protected] for accommodation requests.