Senior Staff Software Engineer, High Performance GPU Inference Systems
at Groq
USD 248,700-292,600 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Kubernetes @ 4 Python @ 6 Algorithms @ 7 Distributed Systems @ 4 Rust @ 6 PyTorch @ 4 CUDA @ 4 GPU @ 4Details
Groq delivers fast, efficient AI inference with an LPU-based system powering GroqCloud™. We are on a mission to make high performance AI compute more accessible and affordable. This role focuses on pushing the limits of heterogeneous GPU environments, dynamic global scheduling, and end-to-end system performance while writing code as close to the metal as possible.
Responsibilities
- Design and implement scalable, low-latency runtime systems that coordinate thousands of GPUs across tightly integrated, software-defined infrastructure (distributed systems engineering).
- Build deterministic, hardware-aware abstractions optimized for CUDA, ROCm, or vendor-specific toolchains to ensure ultra-efficient execution, fault isolation, and reliability (low-level GPU optimization).
- Develop profiling, observability, and diagnostics tooling to provide real-time insights into GPU utilization, memory bottlenecks, and latency deviations; continuously improve system SLOs (performance & diagnostics).
- Future-proof the stack to support evolving GPU architectures (e.g., H100, MI300), NVLink/Fabric topologies, and multi-accelerator systems (including FPGAs or custom silicon).
- Collaborate cross-functionally with ML compilers, orchestration, cloud infrastructure, and hardware ops to ensure architectural alignment and unlock joint performance wins.
- Drive automation, testability, and continuous integration practices for large-scale systems.
Requirements
- Proven ability to ship high-performance, production-grade distributed systems and maintain large-scale GPU production deployments.
- Deep knowledge of GPU architecture (memory hierarchies, streams, kernels), OS internals, parallel algorithms, and HW/SW co-design principles.
- Proficiency in systems languages such as C++ (CUDA), Python, or Rust, with fluency writing hardware-aware code.
- Strong experience and obsession with performance profiling, GPU kernel tuning, memory coalescing, and resource-aware scheduling.
- Comfortable working across stack layers—from GPU drivers and kernels to orchestration layers and inference serving.
- Passion for automation, testability, CI, and tooling to support reliability and performance diagnostics.
Additionally Nice to Have
- Experience operating large-scale GPU inference systems in production (e.g., Triton, TensorRT, or custom GPU services).
- Experience deploying and optimizing ML/HPC workloads on GPU clusters (Kubernetes, Slurm, Ray, etc.).
- Hands-on experience with multi-GPU training/inference frameworks (PyTorch DDP, DeepSpeed, JAX).
- Familiarity with compiler tooling and graph optimization (TVM, MLIR, XLA).
- Experience delivering technically ambitious projects in fast-paced environments.
Attributes of a Groqster
- Humility, collaboration, growth mindset, curiosity, innovation, passion, grit, and ownership.
Benefits & Compensation
- Competitive base salary range for United States roles: $248,710 to $292,600 (base). Compensation includes equity and benefits; exact pay is determined by location, skills, qualifications, and experience.
- Groq is an Equal Opportunity Employer and is committed to reasonable accommodations for qualified individuals with disabilities. For accommodation requests contact [email protected].
- All offers contingent upon verification of identity and employment authorization in accordance with federal law.