Site Reliability Engineer, Frontier Systems Infrastructure

at OpenAI

📍 San Francisco, United States

USD 255,000-490,000 per year

MIDDLE

✅ On-site

✅ Relocation

Tech Stack
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

CloudFormation @ 2 Distributed Systems @ 3 GPU @ 3 Go @ 6 Hiring @ 3 Kubernetes @ 3 Linux @ 3 Networking @ 3 Python @ 6 Security @ 5 Terraform @ 2

Details

The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world used for cutting-edge model training. The team turns data center designs into working systems and builds software for running large-scale frontier model trainings. The mission is to bring up, stabilize, and keep these hyperscale supercomputers reliable and efficient during training of frontier models.

About the Role

This role blends distributed systems engineering with hands-on infrastructure work in large datacenters. You will operate the next generation of compute clusters that power OpenAI’s frontier research: scaling Kubernetes clusters to massive scale, automating bare-metal bring-up, and building software layers that hide the complexity of many nodes across multiple data centers. You will work where hardware and software intersect; speed and reliability are critical. Expect to manage fast-moving operations, quickly diagnose and fix urgent incidents, and continuously raise the bar for automation and uptime.

Responsibilities

Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management.
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads.
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale.
Improve operational metrics such as reducing cluster restart times and accelerating firmware or OS upgrade cycles.
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure.
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load.
Execute at the same level as a software engineer and contribute to automation and reliability improvements.

Requirements / Qualifications

Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments.
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads.
Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and automating cluster or data center operations.
Strong programming or scripting skills (Python, Go, or similar).
Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation.
Comfortable with bare-metal Linux environments, GPU hardware, and large-scale networking.
Enjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual work.
Ability to balance careful engineering with the urgency of keeping mission-critical systems running.

Preferred / Bonus

Background with GPU workloads, firmware management, or high-performance computing.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring general-purpose AI benefits all of humanity. The company emphasizes safety, diverse perspectives, and equal opportunity. Background checks will be administered in accordance with applicable law. OpenAI provides reasonable accommodations to applicants with disabilities.

Benefits & Compensation Notes

Base salary range listed for this role: $255,000 - $490,000 (base pay may vary depending on location, knowledge, skills, and experience).
Total compensation can include equity and performance-related bonuses for eligible employees.
Benefits include medical, dental, and vision insurance; employer contributions to HSAs; pre-tax FSAs; 401(k) with employer match; paid parental leave; paid time off; paid company holidays; mental health and wellness support; employer-paid basic life and disability coverage; annual learning stipend; daily meals and meal credits; and additional taxable fringe benefits.
Relocation support is offered for eligible employees.

For more details, candidates are referred to OpenAI policies and benefit materials provided during the hiring process.