Training: ML Framework Engineer
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 6 Distributed Systems @ 3 Machine Learning @ 3 Performance Optimization @ 3Details
Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, the team builds a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve.
About the Team
The team focuses on three pillars: high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes. The team integrates proven large-scale capabilities into a composable, developer-facing runtime so teams can iterate quickly and run reliably at any scale. Success is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and products).
About the Role
As a Training: ML Framework Engineer, you will work on improving the training throughput for the internal training framework while enabling researchers to experiment with new ideas. This requires strong engineering (designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code, and acquiring deep knowledge of supercomputer performance. Projects aim to push the field forward by improving performance for large runs with massive numbers of GPUs.
This role is based in San Francisco, CA and uses a hybrid work model of 3 days in the office per week. Relocation assistance is offered to new employees.
Responsibilities
- Apply the latest techniques in the internal training framework to achieve impressive hardware efficiency for training runs.
- Profile and optimize the training framework for performance and scalability.
- Work closely with researchers to enable development of the next generation of models.
- Design, implement, and optimize distributed training components (training loops, state management, checkpointing, orchestration, observability).
Requirements
- Strong software engineering skills and proficiency in Python.
- Experience with machine learning experiments (running small-scale ML experiments is a plus).
- Deep interest in performance optimization and distributed systems; ability to profile and optimize large-scale training runs on many GPUs and supercomputers.
- Familiarity with training frameworks, state management, resilient checkpointing, deterministic orchestration, and observability for distributed training.
- Attention to detail and commitment to producing bug-free code.
Benefits
- Base pay range listed: $245,000 β $385,000 (in addition to equity and potential performance-related bonuses).
- Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
- Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
- 401(k) retirement plan with employer match.
- Paid parental leave and paid medical/caregiver leave.
- Paid time off (flexible PTO for exempt employees) and company holidays/office closures.
- Mental health and wellness support; employer-paid basic life and disability coverage.
- Annual learning and development stipend.
- Daily meals in offices and meal delivery credits as eligible.
- Relocation support for eligible employees.
Company
OpenAI is an AI research and deployment company focused on ensuring general-purpose artificial intelligence benefits all of humanity. The company is an equal opportunity employer and provides reasonable accommodations for applicants with disabilities. Background checks and fair chance policies are applied where required by law.