Software Engineer, GPU Infrastructure

at OpenAI
USD 255,000-490,000 per year
MIDDLE
✅ Hybrid
✅ Relocation

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Kubernetes @ 3 CI/CD @ 3 Azure @ 3 GPU @ 3

Details

This role will support the fleet infrastructure team at OpenAI. The fleet team focuses on running the world’s largest, most reliable, and frictionless GPU fleet to support OpenAI’s general purpose model training and deployment. Work on this team includes maximizing GPU utilization through scheduling and quota systems, provisioning and upgrading Kubernetes clusters with automation, supporting research workflows with service frameworks and deployment systems, and ensuring fast model startup via high-performance snapshot delivery from blob storage down to hardware caching.

About the Role

As an engineer within Fleet Infrastructure, you will design, write, deploy, and operate infrastructure systems for model deployment and training on one of the world’s largest GPU fleets. The scale is immense, timelines are tight, and the organization is moving fast; this is an opportunity to shape a critical system in support of OpenAI's mission to advance AI capabilities responsibly.

This role is based in San Francisco, CA and uses a hybrid work model (3 days in the office per week). Relocation assistance is offered to new employees.

Responsibilities

  • Design, implement, and operate components of the compute fleet, including job scheduling, cluster management, snapshot delivery, and CI/CD systems.
  • Interface with researchers and product teams to understand workload requirements and ensure the platform meets their needs.
  • Collaborate with hardware, infrastructure, and business teams to provide high utilization and highly reliable services.
  • Build push-button automation for Kubernetes cluster provisioning and upgrades.
  • Improve model startup times through high-performance snapshot delivery and hardware caching strategies.

Requirements

  • Experience with hyperscale compute systems.
  • Strong programming skills.
  • Experience working in public clouds, especially Microsoft Azure.
  • Experience working with Kubernetes.
  • Execution-focused mentality with a rigorous focus on user requirements.
  • Understanding of AI/ML workloads is a bonus.

Benefits

  • Competitive base pay (see salary range).
  • Equity and potential performance-related bonuses for eligible employees.
  • Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
  • 401(k) retirement plan with employer match.
  • Paid parental leave and paid medical/caregiver leave.
  • Flexible PTO for exempt employees and up to 15 days annually for non-exempt employees.
  • 13+ paid company holidays and additional paid office closures.
  • Mental health and wellness support; employer-paid basic life and disability coverage.
  • Annual learning and development stipend.
  • Daily meals in offices and meal delivery credits as eligible.
  • Relocation support for eligible employees.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company emphasizes safety and inclusive perspectives in building and deploying AI systems. OpenAI is an equal opportunity employer and provides reasonable accommodations to applicants with disabilities. Background checks are administered in accordance with applicable law.