Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 5 Python @ 5 CI/CD @ 6 Distributed Systems @ 3 Hiring @ 3 Bash @ 5 Rust @ 5 Debugging @ 3 PyTorch @ 3Details
Join the team that builds and operates Groq’s real-time, distributed inference system delivering large-scale inference for LLMs and next-gen AI applications at ultra-low latency. As a Low-Level Production Engineer, your mission is to ensure reliability, fault tolerance, and operational excellence in Groq’s LPU-powered infrastructure. You’ll work deep in the stack—bridging distributed runtime systems with the hardware—to keep Groq systems fast, stable, and production-ready at scale.
Responsibilities
- Production Reliability: Operate and harden Groq’s distributed runtime across thousands of LPUs, ensuring uptime and resilience under dynamic global workloads.
- Low-Level Debugging: Diagnose and resolve hardware-software integration issues in live environments, from datacenter level events to single component failures.
- Observability & Diagnostics: Build tools and infrastructure to improve real-time system monitoring, fault detection, and SLO tracking.
- Automation & Scale: Automate deployment workflows, failover systems, and operational playbooks to reduce overhead and accelerate reliability improvements.
- Performance & Optimization: Profile and tune production systems for throughput, latency, and determinism—every cycle counts.
- Cross-Functional Collaboration: Partner with compiler, hardware, infra, and data center teams to deliver robust, fault-tolerant production systems.
Requirements / Ideal candidates have/are
- Proven experience in production engineering across the stack and operating large-scale distributed systems.
- Deep knowledge of computer architecture, operating systems, and hardware-software interfaces.
- Skilled in low-level systems programming (C/C++ or Rust), with scripting fluency (Python, Bash, or Go).
- Comfortable debugging complex issues close to the metal—kernels, firmware, or hardware-aware code paths.
- Strong background in automation, CI/CD, and building reliable systems that scale.
- Thrive across environments—from kernel internals to distributed runtimes to data center operations.
- Communicate clearly, make pragmatic decisions, and take ownership of long-term outcomes.
Nice to have
- Experience operating high-performance, real-time systems at scale (ML inference, HPC, or similar).
- Familiarity with GPUs, FPGAs, or ASICs in production environments.
- Prior exposure to ML frameworks (e.g., PyTorch) or compiler tooling (e.g., MLIR).
- Track record of delivering complex production systems in high-impact environments.
Compensation
At Groq, a competitive base salary is part of our comprehensive compensation package, which includes equity and benefits. For this role, the base salary range is $236,360 to $278,070, determined by your location, skills, qualifications, experience and internal benchmarks. This range is specific to roles in the United States, compensation for candidates outside the USA will be dependent on the local market.
Equal Opportunity & Accommodations
Groq is an Equal Opportunity Employer and complies with applicable federal, state, and local laws governing nondiscrimination in employment. Groq is committed to providing reasonable accommodations to qualified individuals with disabilities. For accommodation requests related to the application or hiring process, contact [email protected] (contact is for accommodation requests only). All offers of employment are contingent upon verification of identity and employment authorization in accordance with federal law.