Software Engineer, Compute Infrastructure
š New York City, United States
š San Francisco, United States
š Seattle, United States
Used Tools & Technologies
HPCRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Kubernetes @ 6
Distributed Systems @ 6
Communication @ 3
Networking @ 6
Performance Optimization @ 3
Debugging @ 3
API @ 3
GPU @ 3
Observability @ 3
AI @ 3
Profiling @ 6
NCCL @ 3
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
About the Team:
Compute Infrastructure builds the platform that turns enormous amounts of compute into a reliable engine for frontier AI. The team designs, provisions, schedules, operates, and optimizes systems that connect accelerators, CPUs, networks, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability into one coherent experience for researchers and product teams.
The work spans the entire stack: capacity planning and cluster lifecycle, bare-metal automation, distributed systems, Kubernetes and scheduling, deep system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and developer experience. Small improvements to communication, scheduling, hardware efficiency, or debugging workflows compound into meaningful research velocity. This posting is a general opening across Compute Infrastructure to match engineers to problems where they can have the most leverage.
About the Role
We are looking for engineers who want to build the compute platform behind OpenAI's research and products. Candidates may specialize in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or the user experience around infrastructure. The role requires careful reasoning about complex systems, writing durable software, and raising the quality and velocity of peers.
Depending on background and interests, you might work close to hardware, close to users, on CaaS and agent infrastructure, or on control planes and data planes. Possible work includes bringing new supercomputing capacity online, optimizing training workloads from profiler traces and benchmarks, improving NCCL and collective communication behavior, reasoning about GPUs, NICs, topology, firmware, thermals, and failure modes, or designing abstractions that make heterogeneous clusters feel like one coherent platform.
Where you might work
- Compute Foundations: Build low-level platform primitives for heterogeneous hardware, providers, and data centers.
- Fleet / Orchestration: Build reliable, efficient clusters and scheduling systems for researchers and product teams.
- Core Network Engineering: Build and operate high-performance networking fabrics, protocols, and observability for large training and serving workloads.
- Hardware Health and Observability: Detect, diagnose, remediate, and prevent hardware and fleet-health issues across providers and accelerator generations.
- Storage: Build scalable, performant, durable storage abstractions to avoid data movement and storage access bottlenecks.
- Agent Infrastructure: Build sandboxed execution infrastructure for agentic workloads with strong isolation, reliability, and scale.
Responsibilities
- Build and deeply optimize reliable system software for large-scale compute systems that run demanding AI workloads.
- Design and operate infrastructure across accelerators, CPUs, NICs, switches, networking protocols, storage, data centers, cluster orchestration, scheduling, and fleet health.
- Profile, benchmark, and optimize training workloads across compute, memory, storage, networking, NCCL and collective communication, and cluster scheduling bottlenecks.
- Create hardware-aware automation for provisioning, firmware and driver upgrades, incident response, and day-to-day operations.
- Build CaaS, agent infrastructure, profiling, observability, benchmarking, and platform tools to help researchers, product engineers, and operators launch, debug, and optimize workloads.
- Turn operational lessons into better systems, stronger abstractions, and clearer ownership boundaries across teams.
- Collaborate across research, engineering, security, networking, hardware, and data center teams.
You might thrive in this role if you
- Have built or operated distributed systems, infrastructure platforms, high-performance computing environments, large-scale networking systems, Kubernetes clusters, developer tools, or production systems with demanding reliability requirements.
- Enjoy working across layers of the stack and are comfortable moving between software, hardware, networking, systems performance, reliability, and user needs.
- Care about making complex infrastructure understandable, observable, and usable for the people depending on it.
- Can diagnose hard problems under real operational pressure while investing in long-term engineering quality.
- Like building leverage for others through APIs, automation, debugging tools, CaaS and agent infrastructure primitives, workflow improvements, or platform abstractions.
- Are motivated by scale, efficiency, reliability, and disciplined measurement through benchmarks, profiles, and production evidence.
- Communicate clearly, take ownership, and work well with teams whose constraints and goals differ from your own.
Qualifications
- Strong software engineering skills and experience building, operating, or improving production infrastructure systems.
- Experience in one or more relevant areas such as distributed systems, operating systems, networking protocols, RDMA, NCCL or collective communication, storage, Kubernetes, scheduling, observability, reliability engineering, high-performance computing, GPU infrastructure, CaaS, agent infrastructure, hardware-aware performance optimization, benchmarking, developer experience, or infrastructure tooling.
- Ability to debug complex system behavior across software, hardware, networking, and workload layers, and turn findings into robust improvements.
- Comfort with ambiguity, strong ownership, and a bias toward practical, durable solutions.
- Interest in working on infrastructure that directly enables frontier AI research and product impact.
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company emphasizes safety, diverse perspectives, and inclusion. OpenAI is an equal opportunity employer and provides information on applicant privacy, background checks, and accommodation processes.
Benefits
The base pay offered may vary depending on location, knowledge, skills, and experience. In addition to the listed salary range, total compensation includes equity, performance-related bonuses for eligible employees, and benefits including:
- Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
- Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
- 401(k) retirement plan with employer match.
- Paid parental leave and paid medical and caregiver leave.
- Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees.
- 13+ paid company holidays and additional paid office closures.
- Mental health and wellness support.
- Employer-paid basic life and disability coverage.
- Annual learning and development stipend.
- Daily meals in offices and meal delivery credits as eligible.
- Relocation support for eligible employees.
- Additional taxable fringe benefits such as charitable donation matching and wellness stipends.