Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 7 Distributed Systems @ 3 Communication @ 4 SRE @ 7 API @ 4 Reporting @ 4 Compliance @ 4Details
About the Team
The Technical Support team ensures developers and enterprises can reliably build mission-critical solutions using OpenAI models. The team provides technical guidance, resolves complex issues, and supports customers in maximizing value and adoption from deploying highly-capable models. The team works closely with Technical Success, Product, and Engineering to deliver the best possible experience at scale, using an automation-first mindset and leveraging AI to scale support operations.
About the Role
You will collaborate directly with strategic enterprise accounts and product teams to solve difficult problems for customers. You will be part of the technical troubleshooting team and act as the last line of defense before core Engineering. Responsibilities include designing and running operational processes to monitor top strategic customers, operating a 24x7 response team, and working closely with Infrastructure and Engineering to deliver reliable customer experiences at scale. This is a low-volume, high-difficulty role.
This role is based in San Francisco, CA and uses a hybrid work model (3 days in office per week). Relocation assistance is offered to new employees.
Responsibilities
- Serve as a foremost technical and troubleshooting expert for the OpenAI API platform; act as the last escalation before core Engineering.
- Proactively identify and implement opportunities to scale support operations leveraging automation and AI.
- Configure and use advanced monitoring and alerting workflows to detect customer-impacting issues in real time.
- Partner with engineering on reliability reviews and preparedness for new features, launches, and strategic customer requirements; ensure operational readiness (monitoring, alerting, fallback plans).
- Design and refine incident response processes and documentation across strategic customers, engineering, and support teams.
- Analyze operational metrics and incident RCAs to identify improvements; recommend and implement enhancements to dashboards, alert configurations, and support workflows.
- Provide support coverage during holidays and weekends based on business needs.
Requirements
- Bachelor’s degree in Computer Science or a related field (or equivalent experience); strong software engineering foundation required.
- 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in mission-critical environments.
- Deep familiarity with modern monitoring, alerting, and observability practices; hands-on experience setting up or managing metrics, logging, and tracing for distributed systems (including SLIs/SLOs and alert tuning).
- Proven experience leading incident response for high-severity outages: real-time incident coordination, root cause analysis, post-mortems, and driving follow-ups.
- Strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools.
- Solid understanding of cloud infrastructure and distributed systems fundamentals; comfortable with cloud services, load balancers, databases, and containerized applications.
- Effective cross-functional communication skills; able to explain technical issues and coordinate efforts across engineering and non-technical stakeholders.
Technologies and Skills Mentioned
- OpenAI API platform
- Monitoring, alerting, observability (metrics, logging, tracing)
- SLIs / SLOs, dashboard creation, alert tuning
- Incident response, RCA / post-mortems
- Scripting / automation (Python or similar)
- Cloud infrastructure, load balancers, databases, containerized applications
- Distributed systems fundamentals
Benefits
- Competitive base pay (see compensation) and equity offers
- Medical, dental, and vision insurance with employer HSA contributions
- Pre-tax accounts (Health FSA, Dependent Care FSA, commuter)
- 401(k) with employer match
- Paid parental, medical, and caregiver leave
- Flexible PTO and paid company holidays
- Mental health and wellness support
- Employer-paid basic life and disability coverage
- Annual learning and development stipend
- Daily meals in offices and meal delivery credits as eligible
- Relocation support for eligible employees
- Additional taxable fringe benefits (charitable matching, wellness stipends)
Other Notes
- Background checks administered in accordance with applicable law; reasonable accommodations available for applicants with disabilities.
- For compliance or reporting concerns, the posting includes links to OpenAI policies and reporting forms.