Training Performance Engineer

at OpenAI
USD 250,000-460,000 per year
MIDDLE
βœ… Hybrid
βœ… Relocation

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Python @ 6 Distributed Systems @ 3 TensorFlow @ 3 Communication @ 3 Rust @ 6 Debugging @ 3 PyTorch @ 3 CUDA @ 6 GPU @ 3

Details

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, the team builds a unified, modular runtime focused on high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes.

This role is based in San Francisco, CA. The team uses a hybrid work model (three days in the office per week) and offers relocation assistance to new employees.

Responsibilities

  • Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
  • Optimize GPU utilization and throughput for large-scale distributed model training.
  • Analyze GPU kernel performance and collective communication throughput; investigate I/O bottlenecks.
  • Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
  • Implement model graph transforms and model sharding to improve end-to-end throughput.
  • Build tooling to monitor and visualize MFU (machine fraction used), throughput, and uptime across clusters.
  • Partner with researchers to ensure new model architectures scale efficiently during pre-training.
  • Contribute to infrastructure and reliability decisions for large training jobs.

Requirements

  • Strong programming skills in Python and C++; Rust or CUDA is a plus.
  • Experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Experience debugging complex distributed systems and measuring efficiency rigorously.
  • Exposure to frameworks such as PyTorch, JAX, or TensorFlow and understanding of large-scale training loop construction.
  • Familiarity with communication libraries such as NCCL, MPI, or UCX is a plus.
  • Experience or exposure to large-scale data loading and checkpointing systems.
  • Prior work on training runtimes, distributed scheduling, or ML compiler optimization is a plus.

Benefits

  • Competitive base pay (range published separately) and equity.
  • Medical, dental, and vision insurance with employer HSA contributions.
  • Pre-tax accounts (Health FSA, Dependent Care FSA, commuter benefits).
  • 401(k) with employer match.
  • Paid parental, medical, and caregiver leave; flexible PTO.
  • 13+ paid company holidays and coordinated office closures.
  • Mental health and wellness support; employer-paid basic life and disability coverage.
  • Annual learning and development stipend.
  • Daily meals in offices and meal delivery credits as eligible.
  • Relocation support for eligible employees.

About OpenAI

OpenAI is an AI research and deployment company focused on building and safely deploying general-purpose AI. The company values diverse perspectives and is an equal opportunity employer. Background checks are administered in accordance with applicable law. Reasonable accommodations for applicants with disabilities are available.