Software Engineer, Workload Enablement

at OpenAI
USD 293,000-455,000 per year
MIDDLE
βœ… Hybrid
βœ… Relocation

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Kubernetes @ 3 Python @ 5 Distributed Systems @ 5 Hiring @ 3 Communication @ 3 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 6 PyTorch @ 6 CUDA @ 5 GPU @ 3 AI @ 3 Profiling @ 6 NCCL @ 3 HPC @ 5 NVLink @ 3

Details

About the Team

The Scaling team is responsible for the architectural and engineering backbone of OpenAI's infrastructure. We design and deliver advanced systems that support the deployment and operation of cutting-edge AI models. Our work spans system software, networking, platform architecture, fleet-level monitoring, and performance optimization.

About the Role

We’re hiring an SW Engineer to enable production workloads and end-to-end testing on new platforms. This role includes creating new test harnesses and platform stress benchmarks, porting existing inference and training workloads to new, sometimes early-access, systems/hardware, analyzing performance and bottlenecks, and characterizing the end-to-end behavior of new systems (compute, comms, storage, control plane, and failure modes).

Responsibilities

  • Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
  • Build a suite of benchmarks and stress tests that capture real end-to-end behavior of workloads by exercising CPU, GPU, memory subsystem, frontend, scale-up and scale-out networking (including WAN traffic, NVLink and RDMA collectives), storage, thermals, and other relevant components.
  • Deep-dive performance on distributed training/inference, including collective performance and tuning (across NCCL/RCCL and internal libraries), overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects.
  • Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
  • Partner with systems and fleet bring-up engineers to ensure platforms are operationally usable and scalable (containerization, Kubernetes integration, telemetry hooks, failure triage loops).
  • Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.

Requirements

  • BS in Computer Science, Electrical Engineering, or equivalent practical experience.
  • 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
  • Strong hands-on experience with PyTorch and modern LLM training/inference stacks.
  • Knowledge of large-scale distributed training concepts (data/model/pipeline parallelism, collective communications).
  • Experience with RDMA and debugging/optimizing communications libraries (NCCL or RCCL) and their interaction with hardware and network.
  • Proficiency in Python and comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
  • Strong profiling and debugging skills (examples: Nsight, rocprof, perf, flamegraphs) and ability to reason from traces and counters.

Preferred Skills

  • Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior (beyond synthetic loops or microbenchmarks).
  • Familiarity with RDMA networking and transport tuning; understanding of how network topology and congestion impact collectives.
  • Experience running and validating workloads in Kubernetes and bridging research code into robust, repeatable infrastructure.
  • Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).

Benefits

  • Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
  • 401(k) retirement plan with employer match.
  • Paid parental leave and paid medical and caregiver leave.
  • Paid time off (flexible PTO for exempt employees; up to 15 days annually for non-exempt employees), 13+ paid company holidays, and additional paid closures and sick/safe time per applicable laws.
  • Mental health and wellness support; employer-paid basic life and disability coverage.
  • Annual learning and development stipend.
  • Daily meals in offices and meal delivery credits as eligible.
  • Relocation support for eligible employees.
  • Additional taxable fringe benefits (charitable donation matching, wellness stipends) may be provided.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of AI capabilities and seek to safely deploy them through our products. OpenAI is an equal opportunity employer and provides reasonable accommodations to applicants with disabilities. Background checks will be administered in accordance with applicable law.