Software Engineer, System Enablement

at OpenAI

📍 San Francisco, United States
📍 Seattle, United States

USD 293,000-455,000 per year

MIDDLE

✅ On-site

✅ Relocation

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 5 Ansible @ 3 Chef @ 3 Go @ 3 Kubernetes @ 3 Linux @ 5 IaC @ 3 Terraform @ 3 Python @ 3 GCP @ 3 AWS @ 3 Azure @ 3 Bash @ 3 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 Reporting @ 3 AI @ 3

Details

About the Team

The Scaling team is responsible for the architectural and engineering backbone of OpenAI’s infrastructure. The team designs and delivers advanced systems that support the deployment and operation of cutting-edge AI models, spanning system software, networking, platform architecture, fleet-level monitoring, and performance optimization.

About the Role

You will take early, sometimes messy, pre-production hardware and make it production-ready: bootstrapped, stable, imaged, joined to the appropriate Kubernetes control plane, registered correctly, schedulable, and observable. The role sits at the intersection of early hardware bring-up, provisioning automation, fleet/cluster management systems, and lab or cloud provider integration—turning new SKUs into capacity usable by internal customers.

Responsibilities

Own the end-to-end bring-up and bootstrap path for new systems and compute nodes from bare metal / early access in lab or production/cloud environments to schedulable fleet capacity: image build, user-data/config, cluster join, and readiness gates.
Build and maintain golden image and provisioning workflows across lab and production environments, working with partner-provided base images and reconciling OS/version requirements.
Integrate nodes into fleet infrastructure and IaC pipelines (Terraform, Chef, etc.), ensuring cloud resources map cleanly onto internal lifecycle expectations (e.g., VMSS/instance pools, image references).
Partner with scheduling and platform owners to ensure new hardware is reachable and scheduled (pool definitions, network/WAN connectivity/routing, admission controls, platform-specific quirks), including cases where new SKUs require scheduling integration changes.
Drive registration and inventory correctness (systems that track nodes and their metadata), including hands-on support to get nodes registered and visible end-to-end.
Collaborate with partner teams to implement baseline health and telemetry bring-up: minimum viable health signals, pass/fail checks, and automated reporting suitable for early ramp decisions.
Debug issues across layers: PXE/boot-loader, UEFI/BIOS, BMC, OS bring-up, NIC/network reachability, kubelet/control-plane connectivity, storage constraints, and early rack/lab realities.

Requirements

BS in Computer Science, Electrical Engineering, or equivalent practical experience.
5+ years of experience in systems software development and building/operating Linux-based infrastructure in production or pre-production environments.
Strong, hands-on experience with:
- Kubernetes cluster operations (node lifecycle, bootstrap/join, debugging control-plane connectivity).
- Infrastructure-as-Code / config management (Terraform, Chef/Ansible, etc.).
- Provisioning and imaging (PXE/iPXE, golden images, cloud-init/user-data).
- Networking fundamentals (L2/L3, routing, DNS, fire-walling; comfortable debugging reachability).
Proven ability to write automation in Python, Go, and Bash and ship operational tooling and runbooks.

Preferred Skills

Experience bringing up new hardware platforms (early silicon/servers/NICs) in a lab setting and turning them into stable fleet capacity.
Multi-cloud operational experience (Azure, GCP, AWS, OCI), especially with compute pools (e.g., VMSS / instance pools).
Experience building telemetry and health pipelines (agent-based metrics/logging, health rollups, readiness criteria).
Familiarity with WAN, peering, and multi-site network concepts for cluster deployments.

Benefits

Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit).
401(k) retirement plan with employer match.
Paid parental leave and paid medical and caregiver leave.
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees.
13+ paid company holidays and periodic company office closures, plus paid sick or safe time as required.
Mental health and wellness support; employer-paid basic life and disability coverage.
Annual learning and development stipend; daily meals in offices and meal delivery credits as eligible.
Relocation support for eligible employees.
Additional taxable fringe benefits (charitable donation matching, wellness stipends) and potential equity and performance-related bonuses as part of total compensation.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company emphasizes safety and inclusion and is an equal opportunity employer. Background checks are administered in accordance with applicable law, and reasonable accommodations are provided to applicants with disabilities.