Performance Engineer - GPU

USD 315,000-560,000 per year
MIDDLE
βœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Algorithms @ 3 Distributed Systems @ 3 Communication @ 3 PyTorch @ 2 CUDA @ 3 GPU @ 3

Details

Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you'll architect and implement the foundational systems that power Claude and push the frontiers of what's possible with large language models. You'll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.

Working at the intersection of hardware and software, you'll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack β€” from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.

Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.

Responsibilities

  • Maximize GPU utilization and end-to-end performance for training and inference of large language models.
  • Implement custom kernels and low-level optimizations (tensor core optimization, kernel fusion, memory bandwidth optimization).
  • Design and implement distributed communication strategies for multi-node GPU clusters (NCCL, NVLink, collective communication).
  • Profile production serving and training infrastructure to identify and eliminate bottlenecks (Nsight and other profilers).
  • Develop performance modeling frameworks to predict and optimize GPU utilization.
  • Co-design attention mechanisms and algorithms for next-generation hardware architectures.
  • Partner with hardware vendors to influence future accelerator capabilities and software stacks.
  • Build resilient, fault-tolerant systems and orchestration for large-scale training clusters.

Requirements

  • At least a Bachelor's degree in a related field or equivalent experience.
  • Deep experience with GPU programming and optimization at scale.
  • Experience with GPU kernel development and related tools such as CUDA, Triton, CUTLASS, Flash Attention, and tensor core optimizations.
  • Familiarity with ML compilers and frameworks (PyTorch/JAX internals, torch.compile, XLA) and custom operators.
  • Strong background in performance engineering: kernel fusion, memory bandwidth optimization, profiling (e.g., Nsight).
  • Experience in distributed systems and multi-GPU training: NCCL, NVLink, collective communication, and model parallelism.
  • Knowledge of low-precision and quantization techniques (INT8/FP8, mixed-precision).
  • Experience with production systems: large-scale training infrastructure, cluster orchestration, and fault tolerance.
  • Demonstrated ability to navigate complex systems from hardware interfaces to high-level ML frameworks and to collaborate with researchers and engineers.

Benefits & Compensation

  • Annual base salary range: $315,000 - $560,000 USD.
  • Competitive compensation package including equity, benefits, and may include incentive compensation.
  • Generous vacation and parental leave, flexible working hours, and a collaborative office environment.

Logistics

  • Locations: San Francisco, CA; New York City, NY; Seattle, WA (United States).
  • Location-based hybrid policy: staff are expected to be in one of the offices at least 25% of the time.
  • Visa sponsorship: Anthropic does sponsor visas where possible and retains immigration counsel.
  • Deadline to apply: None (applications reviewed on a rolling basis).

How We're Different

Anthropic focuses on large-scale, high-impact AI research as a cohesive team. We value collaboration, communication skills, and the societal impacts of our work. We encourage applicants from diverse backgrounds and those who may not meet every qualification to apply.