Software Engineer, Frontier Systems

at OpenAI
USD 295,000-440,000 per year
MIDDLE
βœ… On-site
βœ… Relocation

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Linux @ 3 Python @ 5 SQL @ 3 Hiring @ 3 Networking @ 3 Pandas @ 3

Details

About the Team

The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training. The team takes data center designs, turns them into real, working systems and builds any software needed for running large-scale frontier model trainings. The mission is to bring up, stabilize, and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.

About the Role

On the Frontier Systems team, you'll build critical infrastructure that keeps our supercomputers running reliably for cutting-edge AI research. Minimizing disruptions is core to the mission because even a single hardware failure can derail a large-scale training run. Engineers own their work end-to-end and are trusted to make a real impact. This role is for someone who goes deep β€” who thrives on root-causing system-level issues and building automation to catch and fix problems at scale.

Responsibilities

  • Own and improve the system health checks that keep hyperscale supercomputers stable during model training.
  • Lead deep dives into hardware failures and system-level bugs to understand how things break at scale.
  • Build automation that monitors and fixes issues across thousands of machines so researchers can continue without interruption.
  • Develop reproducible analyses to support diagnostics and operational decisions.

Requirements

  • 7+ years of industry experience in software engineering.
  • Proficiency with Python and shell scripting.
  • High degree of comfort digging into noisy data with SQL, PromQL, and Pandas (or other tools as necessary).
  • Experience developing reproducible analyses.
  • A balance of strengths in building and operationalizing systems (monitoring, automation, and reliability work).

Prior hardware expertise is explicitly not required for this role.

Nice to Have / Bonus

  • Experience with low-level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning).
  • Experience with visualization of large data centers and networks.
  • Expertise with network operations and tooling.
  • Expertise with power management and stabilization.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company emphasizes safety, diverse perspectives, and inclusive hiring. Background checks and accommodations processes are described in the posting; applicants are provided links to OpenAI policies and applicant privacy information.

Compensation & Benefits

  • Base pay range listed: $295,000 - $440,000 (total compensation may include equity and performance-related bonuses).
  • Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
  • Pre-tax accounts (Health FSA, Dependent Care FSA, commuter expenses).
  • 401(k) retirement plan with employer match.
  • Paid parental, medical, and caregiver leave; flexible PTO and paid company holidays.
  • Mental health and wellness support; employer-paid basic life and disability coverage.
  • Annual learning and development stipend.
  • Daily meals in offices and meal delivery credits as eligible.
  • Relocation support for eligible employees.

More details about benefits and compensation philosophy are available to candidates during the hiring process.