Software Engineer, Agent Infrastructure

at OpenAI
USD 255,000-405,000 per year
MIDDLE
βœ… Hybrid
βœ… Relocation

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Kubernetes @ 3 Terraform @ 3 Distributed Systems @ 6 Machine Learning @ 6 Hiring @ 3 FastAPI @ 3 gRPC @ 3 API @ 3 ChatGPT @ 3 Codex @ 3

Details

About the Team

The Agent Infrastructure team at OpenAI is responsible for building systems that enable training and deployment of highly useful AI agents, both internally and for the world. We work hand-in-hand with researchers to design and scale the environment in which agentic models are trained β€” providing a workspace for AI models to execute code, debug issues, and develop software just as human SWEs do. Our training environment for agentic models operates at an extremely high scale and has the flexibility to emulate any environment in which an agent might work.

At the same time, our team builds and maintains OpenAI’s core platform for the deployment and execution of agents in production. Our systems power products such as Codex, Operator, tool use in ChatGPT, and future agentic products. Some of the most challenging technical problems in scaling the capabilities and utility of agents and agentic models lie in the infrastructure layer β€” and our team is focused on building the research and production systems that enable OpenAI to train the most capable models in the world, and maximize the utility of our agentic products for users around the world.

This role is based in San Francisco, CA or New York City, NY. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

Responsibilities

  • Build and scale systems to train highly capable agentic models and to deploy agents to production at large scale.
  • Push massive compute clusters to their limits and contribute to a novel in-house container orchestration platform designed to scale beyond systems like Kubernetes.
  • Develop and maintain FastAPI and gRPC APIs used as interfaces for agentic infrastructure in both training and production environments.
  • Use Terraform to stand up and evolve complex infrastructure for research and production workloads.
  • Collaborate with research teams to stand up and optimize systems for novel AI training runs and experimental applications.

Requirements

  • Deep experience building large-scale machine learning / AI infrastructure and reasoning about training at scale (identifying bottlenecks and engineering solutions to optimize system performance).
  • Experience building new systems from 0β†’1 and scaling them to very large scale.
  • Strong focus on performance and optimization for complex, globally-distributed systems.
  • Experience with cloud platforms and infrastructure-as-code tooling such as Terraform.
  • Deep technical expertise in virtualization and containerization technologies (examples called out: Kata, Firecracker, gVisor, Sysbox) and passion for optimizing runtime performance.
  • Comfortable collaborating closely with researchers and product teams to enable novel use cases and experimental training runs.

Technologies & Tools Mentioned

  • FastAPI
  • gRPC
  • Terraform
  • Container orchestration (in-house platform; scale beyond Kubernetes)
  • Virtualization/container runtimes: Kata, Firecracker, gVisor, Sysbox
  • Cloud platforms and infrastructure-as-code

Benefits & Other Information

  • Base salary range: $255K – $405K (offers equity; total comp includes equity and potential bonuses)
  • Medical, dental, and vision insurance with employer contributions to HSAs
  • Pre-tax accounts (Health FSA, Dependent Care FSA, commuter accounts)
  • 401(k) with employer match
  • Paid parental leave and other paid leave policies
  • Flexible PTO (exempt) and paid days off for non-exempt employees
  • 13+ paid company holidays and coordinated office closures
  • Mental health and wellness support, life and disability coverage
  • Annual learning & development stipend
  • Daily meals in offices and meal delivery credits when eligible
  • Relocation support for eligible employees
  • Hybrid work model: 3 days in office per week

Equal Opportunity & Policies

OpenAI is an equal opportunity employer and provides accommodations to applicants with disabilities. Background checks will be administered in accordance with applicable law. Candidates are provided details about policies and privacy during the hiring process.