Senior Staff Software Engineer, High Performance GPU Inference Systems

at Groq

📍 United States

USD 248,700-292,600 per year

SENIOR

✅ Remote

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Kubernetes @ 4 Python @ 6 Algorithms @ 7 Distributed Systems @ 4 Rust @ 6 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

Groq delivers fast, efficient AI inference with an LPU-based system powering GroqCloud™. We are on a mission to make high performance AI compute more accessible and affordable. This role focuses on pushing the limits of heterogeneous GPU environments, dynamic global scheduling, and end-to-end system performance while writing code as close to the metal as possible.

Responsibilities

Design and implement scalable, low-latency runtime systems that coordinate thousands of GPUs across tightly integrated, software-defined infrastructure (distributed systems engineering).
Build deterministic, hardware-aware abstractions optimized for CUDA, ROCm, or vendor-specific toolchains to ensure ultra-efficient execution, fault isolation, and reliability (low-level GPU optimization).
Develop profiling, observability, and diagnostics tooling to provide real-time insights into GPU utilization, memory bottlenecks, and latency deviations; continuously improve system SLOs (performance & diagnostics).
Future-proof the stack to support evolving GPU architectures (e.g., H100, MI300), NVLink/Fabric topologies, and multi-accelerator systems (including FPGAs or custom silicon).
Collaborate cross-functionally with ML compilers, orchestration, cloud infrastructure, and hardware ops to ensure architectural alignment and unlock joint performance wins.
Drive automation, testability, and continuous integration practices for large-scale systems.

Requirements

Proven ability to ship high-performance, production-grade distributed systems and maintain large-scale GPU production deployments.
Deep knowledge of GPU architecture (memory hierarchies, streams, kernels), OS internals, parallel algorithms, and HW/SW co-design principles.
Proficiency in systems languages such as C++ (CUDA), Python, or Rust, with fluency writing hardware-aware code.
Strong experience and obsession with performance profiling, GPU kernel tuning, memory coalescing, and resource-aware scheduling.
Comfortable working across stack layers—from GPU drivers and kernels to orchestration layers and inference serving.
Passion for automation, testability, CI, and tooling to support reliability and performance diagnostics.

Additionally Nice to Have

Experience operating large-scale GPU inference systems in production (e.g., Triton, TensorRT, or custom GPU services).
Experience deploying and optimizing ML/HPC workloads on GPU clusters (Kubernetes, Slurm, Ray, etc.).
Hands-on experience with multi-GPU training/inference frameworks (PyTorch DDP, DeepSpeed, JAX).
Familiarity with compiler tooling and graph optimization (TVM, MLIR, XLA).
Experience delivering technically ambitious projects in fast-paced environments.

Attributes of a Groqster

Humility, collaboration, growth mindset, curiosity, innovation, passion, grit, and ownership.

Benefits & Compensation

Competitive base salary range for United States roles: $248,710 to $292,600 (base). Compensation includes equity and benefits; exact pay is determined by location, skills, qualifications, and experience.
Groq is an Equal Opportunity Employer and is committed to reasonable accommodations for qualified individuals with disabilities. For accommodation requests contact [email protected].
All offers contingent upon verification of identity and employment authorization in accordance with federal law.