Lead, Incidents & Escalations, User Operations

at OpenAI

📍 San Francisco, United States

USD 234,000-315,000 per year

SENIOR

✅ Hybrid

✅ Relocation

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Datadog @ 4 Leadership @ 6 Communication @ 4 SRE @ 8 Jira @ 4 Reporting @ 4 Salesforce @ 4 Observability @ 4 AI @ 4

Details

About the Team

OpenAI’s User Operations team shepherds our customers’ adoption of AI and ensures that our customers' product experience is exceptional. The team resolves complex issues, provides technical guidance, and supports customers in maximizing value and adoption from deploying OpenAI products. User Operations works closely with Sales, Technical Success, Product, Engineering, and other partners. Customers range from early-stage startups to established global enterprises.

About the Role

This is a hands-on player-coach role to build and run OpenAI's Incidents & Escalations function within User Operations. You will set the operating model and also step into active incidents and urgent escalations when needed, coordinating with on-call teams, driving clear ownership, supporting communications, and ensuring issues move through resolution and post-incident closure.

During active incidents you will coordinate with on-call teams and cross-functional responders across Engineering, Infrastructure, Support Delivery, Product, and Go‑To‑Market. You will keep teams aligned, maintain timelines, clarify ownership, escalate when needed, and ensure internal, executive, customer-facing, and external communications are accurate and timely (including status page updates when required).

For escalations you will build and run processes for tracking, triaging, mitigating, and resolving critical customer and user issues. After incidents and escalations you will own follow-through: retrospectives, root cause identification, action item tracking, trend analysis, and process improvements that reduce repeat issues over time. You will help define the long-term operating model for incidents and escalations across Support Delivery, Engineering, Infrastructure, and other partners.

This role is based in San Francisco and uses a hybrid work model (3 days in office per week). Relocation assistance is offered to new employees.

Responsibilities

Participate in an on-call rotation and serve as the active incident lead during live incidents and urgent escalations.
Own alert intake and triage across support, safety, customer, and service-impacting issues.
Assess severity, determine scope and impact, and initiate appropriate response paths.
Page and coordinate Engineering, Infrastructure, Support Delivery, Product, Legal, Policy, Go‑To‑Market, and other teams as needed.
Lead incident response calls, manage timelines, clarify roles, and keep responders focused and unblocked.
Set internal guidelines for incident communications; own internal updates, executive briefings, customer-facing updates, and external status page updates where required.
Maintain situational awareness across customer-facing incidents and parallel workstreams to ensure coordinated responses and clear understanding of customer impact.
Create and operate processes for monitoring, processing, mitigating, and resolving critical escalations, including formal closure and handoff.
Identify root causes behind incidents and escalations, lead retrospectives, and coordinate corrective action with accountable teams.
Track corrective actions to closure, focusing on the best possible customer outcome.
Identify recurring operational issues, escalation patterns, and product or process gaps.
Partner with Engineering, Infrastructure, Product, and Support leaders to reduce repeat incidents and improve readiness.
Improve incident response processes, severity frameworks, playbooks, tooling, reporting, and automation.
Build a durable global operating model for incidents and escalations as OpenAI scales.

Requirements

10+ years of experience in incident management, technical support, escalation management, SRE, technical program management, or production operations.
5+ years of hands-on experience working in production, on-call, or high-urgency operational environments.
5+ years of leadership experience, ideally in a Support, Engineering, or similar environment.
Comfortable acting as Incident Commander: owning coordination, decision-making, communication, and accountability during live incidents.
Direct experience with customer-impacting incidents, executive escalations, safety-sensitive escalations, or high-severity technical issues.
Strong ability to communicate clearly under pressure with engineers, support teams, executives, customer-facing teams, and external stakeholders.
Hands-on experience with incident communications (internal updates, executive briefings, customer updates, status pages).
Experience with incident management, paging, and alerting tools such as incident.io, PagerDuty, Datadog, Jira, Salesforce, Zendesk, or similar systems.
Comfortable reasoning about monitoring and observability to interpret alerts, system health, customer impact, and incident scope.
Ability to lead post-incident retrospectives that produce clear root causes, corrective actions, and durable improvements.
Track record of driving action items to closure and holding teams accountable without creating unnecessary process drag.
Highly organized, calm, and structured in ambiguous or high-pressure situations.
Ability to balance hands-on incident execution with longer-term systems building.
Interest in using AI and automation to improve triage, routing, summarization, reporting, knowledge management, and incident follow-through.

Benefits

Base pay range listed: $234,000 – $315,000 (total compensation also includes equity and performance-related bonuses where applicable).
Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
Pre-tax accounts: Health FSA, Dependent Care FSA, commuter (parking and transit).
401(k) retirement plan with employer match.
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents); paid medical and caregiver leave (up to 8 weeks).
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees.
13+ paid company holidays and coordinated office closures; paid sick or safe time per applicable law.
Mental health and wellness support.
Employer-paid basic life and disability coverage.
Annual learning and development stipend.
Daily meals in offices and meal delivery credits as eligible.
Relocation support for eligible employees.
Additional taxable fringe benefits (charitable donation matching, wellness stipends) may be provided.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of AI capabilities and seek to safely deploy them through our products. OpenAI is an equal opportunity employer and provides reasonable accommodations to applicants with disabilities. Background checks will be administered in accordance with applicable law.