Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Communication @ 3
Networking @ 3
API @ 3
ChatGPT @ 3
GPU @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
The compute infrastructure team runs the GPU fleet and large-scale compute clusters that serve the models backing ChatGPT and the API, while also supporting training workloads for next-generation models. The team operates a large, modern GPU fleet and provides a unified platform for other OpenAI teams to run production Applied AI and Research training workloads. Safety and responsible deployment are prioritized.
This role is based in San Francisco, CA and uses a hybrid work model (3 days in office per week). OpenAI offers relocation assistance to new employees.
Responsibilities
- Own end-to-end delivery of new compute SKUs and large-scale GPU clusters across external partners and providers.
- Drive multi-threaded bring-up programs spanning hardware, networking, power, and cooling; own plans, dependencies, and critical paths.
- Interface with chip providers and work across kernels, communications, hardware, and scheduling engineering teams to derisk long-term onboarding to new hardware platforms.
- Build and operationalize program mechanisms (roadmaps, milestones, risk registers, runbooks) to make delivery predictable at scale.
- Partner with engineering to improve cluster turn-up reliability, repeatability, and automation to reduce time-to-serve for new capacity.
- Coordinate cross-functional readiness with security, finance, operations, and product/research stakeholders to ship production-ready compute.
- Manage integration and handoffs across teams and partners to ensure consistent execution, clear communication, and fast issue resolution.
- Identify bottlenecks and systemic gaps and drive durable fixes across tooling, process, and partner interfaces.
- Provide executive visibility on progress, tradeoffs, and risks across a large portfolio of concurrent programs.
Requirements
- Degree in a hard science or demonstrated track record of engineering expertise.
- 5+ years of experience in program management for major projects, including capital projects or hyperscaler infrastructure deployment.
- Proven ability to be the primary owner responsible for driving and delivering complex projects.
- Comfortable managing cross-functional and cross-company teams; experience driving information and decision hygiene.
- Track record of delivering high-profile technical projects against tight deadlines.
- Technical aptitude with experience partnering effectively with engineering or research teams.
- Experience interfacing with and leading external vendors (engineering firms, equipment suppliers, construction firms).
- Expertise designing and implementing simple, scalable processes to solve complex problems.
- Experience managing complicated dependencies such as logistics and supply chains.
- Strong organizational, risk management, and stakeholder-communication skills.
About OpenAI
OpenAI is an AI research and deployment company focused on ensuring general-purpose artificial intelligence benefits all of humanity. The company emphasizes safe deployment and values diverse perspectives. OpenAI is an equal opportunity employer and provides background checks and reasonable accommodations for applicants with disabilities.
Benefits
- Base pay range listed above; total compensation may include equity and performance-related bonuses.
- Medical, dental, and vision insurance with employer HSA contributions.
- Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
- 401(k) retirement plan with employer match.
- Paid parental, medical, and caregiver leave.
- Flexible PTO for exempt employees; paid time off for non-exempt employees.
- 13+ paid company holidays and additional paid company office closures.
- Mental health and wellness support; employer-paid basic life and disability coverage.
- Annual learning and development stipend.
- Daily meals in offices and meal delivery credits as eligible.
- Relocation support for eligible employees.