Software Engineer, Data Infrastructure - Research

at OpenAI

📍 San Francisco, United States

USD 250,000-380,000 per year

MIDDLE

✅ On-site

✅ Relocation

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Distributed Systems @ 3 Hiring @ 3 Debugging @ 3 API @ 3 LLM @ 3 GPU @ 3

Details

The Workload team designs and runs OpenAI's LLM training and inference infrastructure that powers frontier models at massive scale. The team unifies how researchers train and serve models, abstracting away complexity of performance, parallelism, and execution across large GPU/accelerator fleets. This role focuses on designing and implementing the dataset infrastructure for OpenAI's next-generation training stack, building standardized dataset interfaces, scaling pipelines across thousands of GPUs, and proactively testing performance bottlenecks. You will collaborate closely with multimodal researchers and other infra groups to ensure datasets are unified, efficient, and easy to consume.

Responsibilities

Design and maintain standardized dataset APIs, including for multimodal (MM) data that cannot fit in memory.
Build proactive testing and scale validation pipelines for dataset loading at GPU scale.
Integrate datasets seamlessly into training and inference pipelines to ensure smooth adoption and good user experience.
Document and maintain dataset interfaces so they are discoverable, consistent, and easy for other teams to adopt.
Establish safeguards and validation systems to ensure datasets remain reproducible and unchanged once standardized.
Debug and resolve performance bottlenecks in distributed dataset loading (e.g., straggler systems slowing global training).
Provide visualization and inspection tools to surface errors, bugs, or bottlenecks in datasets.

Requirements

Strong engineering fundamentals with experience in distributed systems, data pipelines, or infrastructure.
Experience building APIs, modular code, and scalable abstractions, with attention to UX of abstractions.
Comfortable debugging bottlenecks across large fleets of machines.
Pride in building infrastructure that is reliable and scalable; ability to own foundational parts of the ML stack.
Collaborative and humble mindset; ability to work closely with researchers and other infra teams.

Bonus (nice to have)

Background knowledge in data math, probability, or distributed data theory.
Experience with GPU-scale distributed systems or dataset scaling for real-time data.

About OpenAI

OpenAI is an AI research and deployment company focused on ensuring general-purpose AI benefits all of humanity. The company stresses safety, diverse perspectives, and inclusive hiring practices. Background checks will be administered in accordance with applicable law for US-based candidates and other jurisdictions as required. OpenAI is an equal opportunity employer and provides reasonable accommodations to applicants with disabilities.

Benefits

Base pay range listed for the role; total compensation may also include equity and performance-related bonuses.
Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
401(k) retirement plan with employer match.
Paid parental, medical, and caregiver leave; flexible PTO; paid company holidays and office closures.
Mental health and wellness support; employer-paid basic life and disability coverage.
Annual learning and development stipend.
Daily meals in offices and meal delivery credits as eligible.
Relocation support for eligible employees.