Software Engineer, Workload Enablement

at OpenAI

📍 San Francisco, United States
📍 Seattle, United States

USD 293,000-455,000 per year

MIDDLE

✅ Hybrid

✅ Relocation

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Kubernetes @ 3 Python @ 5 Distributed Systems @ 5 Hiring @ 3 Communication @ 3 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 6 PyTorch @ 6 CUDA @ 5 GPU @ 3 AI @ 3 Profiling @ 6 NCCL @ 3 HPC @ 5 NVLink @ 3

Details

About the Team

The Scaling team is responsible for the architectural and engineering backbone of OpenAI's infrastructure. We design and deliver advanced systems that support the deployment and operation of cutting-edge AI models. Our work spans system software, networking, platform architecture, fleet-level monitoring, and performance optimization.

About the Role

We’re hiring an SW Engineer to enable production workloads and end-to-end testing on new platforms. This role includes creating new test harnesses and platform stress benchmarks, porting existing inference and training workloads to new, sometimes early-access, systems/hardware, analyzing performance and bottlenecks, and characterizing the end-to-end behavior of new systems (compute, comms, storage, control plane, and failure modes).

Responsibilities

Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
Build a suite of benchmarks and stress tests that capture real end-to-end behavior of workloads by exercising CPU, GPU, memory subsystem, frontend, scale-up and scale-out networking (including WAN traffic, NVLink and RDMA collectives), storage, thermals, and other relevant components.
Deep-dive performance on distributed training/inference, including collective performance and tuning (across NCCL/RCCL and internal libraries), overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects.
Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
Partner with systems and fleet bring-up engineers to ensure platforms are operationally usable and scalable (containerization, Kubernetes integration, telemetry hooks, failure triage loops).
Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.

Requirements

BS in Computer Science, Electrical Engineering, or equivalent practical experience.
5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
Strong hands-on experience with PyTorch and modern LLM training/inference stacks.
Knowledge of large-scale distributed training concepts (data/model/pipeline parallelism, collective communications).
Experience with RDMA and debugging/optimizing communications libraries (NCCL or RCCL) and their interaction with hardware and network.
Proficiency in Python and comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
Strong profiling and debugging skills (examples: Nsight, rocprof, perf, flamegraphs) and ability to reason from traces and counters.

Preferred Skills

Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior (beyond synthetic loops or microbenchmarks).
Familiarity with RDMA networking and transport tuning; understanding of how network topology and congestion impact collectives.
Experience running and validating workloads in Kubernetes and bridging research code into robust, repeatable infrastructure.
Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).

Benefits

Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
401(k) retirement plan with employer match.
Paid parental leave and paid medical and caregiver leave.
Paid time off (flexible PTO for exempt employees; up to 15 days annually for non-exempt employees), 13+ paid company holidays, and additional paid closures and sick/safe time per applicable laws.
Mental health and wellness support; employer-paid basic life and disability coverage.
Annual learning and development stipend.
Daily meals in offices and meal delivery credits as eligible.
Relocation support for eligible employees.
Additional taxable fringe benefits (charitable donation matching, wellness stipends) may be provided.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of AI capabilities and seek to safely deploy them through our products. OpenAI is an equal opportunity employer and provides reasonable accommodations to applicants with disabilities. Background checks will be administered in accordance with applicable law.