Training Performance Engineer
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 6 Distributed Systems @ 3 TensorFlow @ 3 Communication @ 3 Rust @ 6 Debugging @ 3 PyTorch @ 3 CUDA @ 6 GPU @ 3Details
Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, the team builds a unified, modular runtime focused on high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes.
This role is based in San Francisco, CA. The team uses a hybrid work model (three days in the office per week) and offers relocation assistance to new employees.
Responsibilities
- Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
- Optimize GPU utilization and throughput for large-scale distributed model training.
- Analyze GPU kernel performance and collective communication throughput; investigate I/O bottlenecks.
- Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
- Implement model graph transforms and model sharding to improve end-to-end throughput.
- Build tooling to monitor and visualize MFU (machine fraction used), throughput, and uptime across clusters.
- Partner with researchers to ensure new model architectures scale efficiently during pre-training.
- Contribute to infrastructure and reliability decisions for large training jobs.
Requirements
- Strong programming skills in Python and C++; Rust or CUDA is a plus.
- Experience running distributed training jobs on multi-GPU systems or HPC clusters.
- Experience debugging complex distributed systems and measuring efficiency rigorously.
- Exposure to frameworks such as PyTorch, JAX, or TensorFlow and understanding of large-scale training loop construction.
- Familiarity with communication libraries such as NCCL, MPI, or UCX is a plus.
- Experience or exposure to large-scale data loading and checkpointing systems.
- Prior work on training runtimes, distributed scheduling, or ML compiler optimization is a plus.
Benefits
- Competitive base pay (range published separately) and equity.
- Medical, dental, and vision insurance with employer HSA contributions.
- Pre-tax accounts (Health FSA, Dependent Care FSA, commuter benefits).
- 401(k) with employer match.
- Paid parental, medical, and caregiver leave; flexible PTO.
- 13+ paid company holidays and coordinated office closures.
- Mental health and wellness support; employer-paid basic life and disability coverage.
- Annual learning and development stipend.
- Daily meals in offices and meal delivery credits as eligible.
- Relocation support for eligible employees.
About OpenAI
OpenAI is an AI research and deployment company focused on building and safely deploying general-purpose AI. The company values diverse perspectives and is an equal opportunity employer. Background checks are administered in accordance with applicable law. Reasonable accommodations for applicants with disabilities are available.