Training Performance Engineer

at OpenAI

📍 San Francisco, United States

USD 250,000-460,000 per year

MIDDLE

✅ Hybrid

✅ Relocation

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Python @ 6 Distributed Systems @ 3 TensorFlow @ 3 Communication @ 3 Rust @ 6 Debugging @ 3 PyTorch @ 3 CUDA @ 6 GPU @ 3

Details

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, the team builds a unified, modular runtime focused on high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes.

This role is based in San Francisco, CA. The team uses a hybrid work model (three days in the office per week) and offers relocation assistance to new employees.

Responsibilities

Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
Optimize GPU utilization and throughput for large-scale distributed model training.
Analyze GPU kernel performance and collective communication throughput; investigate I/O bottlenecks.
Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
Implement model graph transforms and model sharding to improve end-to-end throughput.
Build tooling to monitor and visualize MFU (machine fraction used), throughput, and uptime across clusters.
Partner with researchers to ensure new model architectures scale efficiently during pre-training.
Contribute to infrastructure and reliability decisions for large training jobs.

Requirements

Strong programming skills in Python and C++; Rust or CUDA is a plus.
Experience running distributed training jobs on multi-GPU systems or HPC clusters.
Experience debugging complex distributed systems and measuring efficiency rigorously.
Exposure to frameworks such as PyTorch, JAX, or TensorFlow and understanding of large-scale training loop construction.
Familiarity with communication libraries such as NCCL, MPI, or UCX is a plus.
Experience or exposure to large-scale data loading and checkpointing systems.
Prior work on training runtimes, distributed scheduling, or ML compiler optimization is a plus.

Benefits

Competitive base pay (range published separately) and equity.
Medical, dental, and vision insurance with employer HSA contributions.
Pre-tax accounts (Health FSA, Dependent Care FSA, commuter benefits).
401(k) with employer match.
Paid parental, medical, and caregiver leave; flexible PTO.
13+ paid company holidays and coordinated office closures.
Mental health and wellness support; employer-paid basic life and disability coverage.
Annual learning and development stipend.
Daily meals in offices and meal delivery credits as eligible.
Relocation support for eligible employees.

About OpenAI

OpenAI is an AI research and deployment company focused on building and safely deploying general-purpose AI. The company values diverse perspectives and is an equal opportunity employer. Background checks are administered in accordance with applicable law. Reasonable accommodations for applicants with disabilities are available.