Performance Modeling Lead

at OpenAI

📍 San Francisco, United States
📍 Seattle, United States

USD 342,000-555,000 per year

SENIOR

✅ Hybrid

✅ Relocation

Used Tools & Technologies

GPU

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Distributed Systems @ 4 Machine Learning @ 4 Hiring @ 4 Communication @ 7 Mentoring @ 4 Networking @ 4 AI @ 4 InfiniBand @ 3 NVLink @ 3

Details

OpenAI’s Hardware organization develops system and infrastructure solutions for advanced AI workloads. The Scaling team focuses on understanding and optimizing performance across the full system stack to ensure architectural decisions are grounded in rigorous, quantitative analysis of real-world workloads.

This role will build and lead a small, high-impact team to answer forward-looking architectural questions across AI infrastructure systems. You will develop modeling frameworks and methodologies to evaluate system-level tradeoffs and guide design decisions that influence reference architectures, vendor designs, and long-term infrastructure strategy. The role is based in San Francisco, CA and uses a hybrid work model (3 days in office per week). Relocation assistance is offered for eligible employees.

Responsibilities

Build and own a performance modeling framework/toolchain to evaluate AI systems across multiple levels of abstraction.
Analyze and quantify architectural tradeoffs across compute, memory, networking, storage, and system topology.
Develop performance models to guide decisions on scale-up vs. scale-out architectures, interconnect and network design, and memory hierarchy/system balance.
Translate modeling outputs into clear recommendations for internal teams and external hardware vendors.
Influence reference designs and vendor roadmaps through data-driven insights.
Partner closely with machine learning, systems, and hardware teams to understand workload characteristics and requirements.
Lead and grow a small team (2–3 engineers), set technical direction, and maintain high standards for modeling rigor.
Continuously improve modeling fidelity by validating against real system behavior and measurements.

Requirements

Experience owning or building performance modeling frameworks used to drive real system design decisions.
Deep knowledge of AI/ML workloads, including training and/or inference at scale.
Understanding of system-level tradeoffs across compute, memory, and networking in large-scale distributed systems.
Comfortable working across abstraction layers—from workload behavior to hardware implementation.
Experience using modeling (analytical or simulation) to inform architectural decisions.
Ability to operate in ambiguous problem spaces and turn open-ended questions into structured analysis.
Strong communication skills and ability to influence internal teams and external partners.

Preferred Skills

Experience working with hardware vendors (ODM/JDM, silicon, networking).
Background in data center infrastructure or hyperscale systems.
Familiarity with accelerators (GPUs/ASICs) and interconnects (e.g., NVLink, InfiniBand, Ethernet).
Experience influencing hardware roadmaps or reference architectures.
Prior experience leading or mentoring engineers.

Benefits

Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
401(k) retirement plan with employer match.
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents) and paid medical/caregiver leave (up to 8 weeks).
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees.
13+ paid company holidays and multiple coordinated office closures, plus paid sick or safe time as required by law.
Mental health and wellness support; employer-paid basic life and disability coverage.
Annual learning and development stipend.
Daily meals in offices and meal delivery credits as eligible.
Relocation support for eligible employees.
Additional taxable fringe benefits such as charitable donation matching and wellness stipends.

About OpenAI

OpenAI is an AI research and deployment company focused on ensuring that general-purpose artificial intelligence benefits all of humanity. The company emphasizes safety, diverse perspectives, and inclusion in its mission and hiring practices.