Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Terraform @ 4 Python @ 4 Hiring @ 4 Bash @ 4 Git @ 4 Networking @ 4 GPU @ 4Details
Groq delivers fast, efficient AI inference. Our LPU-based system powers GroqCloud™, giving businesses and developers the speed and scale they need. From our Bay Area roots to our growing global presence, we are on a mission to make high performance AI compute more accessible and affordable. When real-time AI is within reach, anything is possible. Build fast.
Groq is building a custom cloud from the ground up — one data center at a time. The Compute Storage team owns the systems that turn racks of bare metal into production-ready Kubernetes clusters powering the next generation of AI workloads. We are hiring a Sr. Staff Linux Systems Engineer to help scale this effort by creating a reliable, performant, and secure foundation for Groq Cloud.
Responsibilities
- Kernel and OS level enablement and optimization for compute nodes (GPU, LPU) and storage clusters.
- Work with infrastructure peers to define optimal health standards for production servers, including certified OS, Kernel, BIOS/FW versions.
- Strengthen security posture by improving system-level CVE response processes.
- Debug and resolve systems-level performance and reliability issues across the fleet.
- Work with vendors to debug and resolve BIOS/FW issues.
- Support design and deployment of large GPU clusters and network fabric integrations.
- Lead cross-functional collaboration with data center operations, networking, and platform teams to ensure infrastructure is integrated and production-ready.
- Follow best practices and standards for infrastructure-as-code and configuration management using Git, Flux, Terraform, and related tools.
- Set technical direction and maintain high-quality system documentation, operational runbooks, and internal tooling to improve resilience, repeatability, and observability of the infrastructure stack.
Requirements
- Experience with Linux OS management in large virtualized environments.
- Deep Kernel knowledge and experience working with the upstream community to resolve bugs.
- Experience deploying large GPU clusters and working with network fabric.
- Familiarity with infrastructure-as-code and Git-based workflows (e.g., Terraform, Flux, Kustomize).
- Ability to write and maintain basic tooling in Go, Python, or Bash.
- Understanding of networking fundamentals (IPAM, VLANs, DHCP, DNS).
- Working knowledge of storage concepts (block vs object, NFS, RAID, etc.).
- Strong sense of ownership and willingness to dive into hardware, firmware, or low-level provisioning issues.
Nice to Have
- Exposure to Talos Linux.
- Experience maintaining a production Kubernetes environment.
- Hardware SKU definition and lifecycle management.
Compensation & Benefits
- Base salary range (United States): $310,400 to $365,200, determined by location, skills, qualifications, experience, and internal benchmarks. Compensation outside the USA depends on the local market.
- Competitive base salary plus equity and benefits.
About Groq & Hiring Notes
- Groq is an Equal Opportunity Employer committed to inclusion. Reasonable accommodations are available for applicants with disabilities (contact: [email protected] for accommodation requests only).
- All offers contingent upon verification of identity and employment authorization.
- Groq encourages applicants with criminal record histories to apply in accordance with applicable local fair chance hiring laws.