Technical Lead Manager - Training Runtime, Data(Set) Movement
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 4
Distributed Systems @ 4
Machine Learning @ 4
Rust @ 4
Debugging @ 4
API @ 4
Compliance @ 4
AI @ 4
Reinforcement Learning @ 4
Data Pipelines @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
About the Team
Training Runtime builds the distributed systems that power OpenAI's largest model training runs. The Data Movement area owns the infrastructure that keeps training jobs supplied with the right data at the right time, and keeps model state moving safely and efficiently across large clusters. Work spans machine learning systems, distributed storage, high-throughput data loading, reliability engineering, and developer experience.
About the Role
We are looking for a deeply hands-on Technical Lead Manager to own datasets throughout our training infrastructure. This person will set the direction for how training jobs read data: the APIs, storage contracts, versioning model, benchmarks, debugging tools, and reliability guarantees that make data access consistent across current and future training frameworks. You will begin as the primary technical owner for dataset reads, working directly in the code while aligning researchers, training framework owners, storage teams, and infrastructure partners around a durable platform.
The problem includes making enormous, heterogeneous datasets easy to consume, fast to restart, correct across distributed workers, observable when something goes wrong, and flexible enough to support pretraining, reinforcement learning, and multimodal training.
Responsibilities
- Design and build a unified dataset read platform for multiple current and future training frameworks.
- Define dataset APIs, storage-format expectations, registration/versioning, and migration paths that make data access reproducible and maintainable.
- Build reliability into the read path, including stateful iteration, caching, fast restart, recovery, and clear operational contracts.
- Build terminal and web-based visualizers that let teams inspect text, multimodal, and reinforcement learning data late in the pipeline.
- Write and review production code in core data loading, service, caching, and reliability paths.
- Partner with teams working on training frameworks, reinforcement learning, multimodal models, storage, runtime, and cluster infrastructure.
Over Time
The long-term goal is a team that owns fast, correct, scalable, and reliable in-cluster data movement for training: data that comes in, data that goes out, and data that moves around inside the cluster. After ramping on datasets, this role will expand to Technical Lead Manager ownership for broader data movement systems, including checkpoint loads/saves and snapshot transfers, while partnering closely with existing technical leads and adjacent infrastructure teams.
Requirements / Qualifications
- Experience building or owning dataset, data loading, storage, or distributed training infrastructure at large scale (example: torch.utils.data).
- Strong attention to API design, debugging ergonomics, performance, and bit-level correctness.
- Understanding of failure modes of large distributed training jobs and how data systems can create or prevent them.
- Experience with stateful iterators, checkpoint/restart semantics, caching, remote services, or high-throughput storage reads.
- Comfortable working across Python and lower-level systems code; Rust or C++ experience is useful but not required.
- Experience with multimodal, video, reinforcement learning, or pretraining data pipelines where small data bugs are expensive and hard to diagnose.
- Ability to lead through code and technical judgment before a team exists, and to later manage engineers while remaining hands-on.
- Focus on developer experience by eliminating friction (e.g., reducing manual preprocessing scripts and niche cluster-specific bugs).
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company values diversity and inclusion and provides information on applicant privacy, accommodations, and compliance resources.
Benefits
- Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
- Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
- 401(k) retirement plan with employer match.
- Paid parental leave and paid medical/caregiver leave.
- Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees.
- 13+ paid company holidays and additional company office closures.
- Mental health and wellness support; employer-paid basic life and disability coverage.
- Annual learning and development stipend.
- Daily meals in offices and meal delivery credits as eligible.
- Relocation support for eligible employees.
- Additional taxable fringe benefits such as charitable donation matching and wellness stipends.