Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 6
Linux @ 6
Python @ 6
Bash @ 6
Networking @ 3
Debugging @ 3
GPU @ 3
Observability @ 3
AI @ 3
HPC @ 2
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
The Stargate team builds the physical and logical infrastructure that powers large-scale AI systems — designing and delivering next-generation compute environments for frontier model training and inference. This Systems Engineer role will help architect, validate, and operationalize core infrastructure systems across networking, storage, system bring-up, hardware debugging, and cluster readiness. The role is based in San Francisco with a hybrid work model (3 days in office per week) and offers relocation assistance.
Responsibilities
- Own system engineering workstreams across networking, storage, system validation, or bring-up.
- Design and improve top-of-network architectures spanning frontend, WAN, OOB, firewall, and adjacent infrastructure layers.
- Drive logical network readiness including routing, configuration management, provisioning, and issue resolution.
- Define storage architectures across in-rack, in-pod, cluster, and cloud tiers focusing on performance, lifecycle, and cost efficiency.
- Evaluate vendor hardware and infrastructure proposals and provide technical feedback on architecture, reliability, and operational fit.
- Lead system bring-up for new hardware platforms including imaging, provisioning, validation, and readiness for production deployment.
- Debug complex system faults across firmware, NIC, GPU, server, and platform layers; drive root cause analysis with internal teams and external vendors.
- Build tools and automation to improve lab operations, SKU onboarding, fleet readiness, and deployment velocity.
- Partner with hardware, clusters, and operations teams to translate new compute platforms into stable production environments.
- Establish repeatable engineering standards, operational processes, and readiness gates for future expansions.
Requirements
- 7+ years of experience in systems engineering, infrastructure engineering, hardware platforms, or large-scale compute environments.
- Strong technical depth in one or more areas: networking, storage systems, server platforms, firmware, Linux systems, or distributed infrastructure.
- Experience bringing up new hardware systems or clusters in lab or production environments.
- Experience debugging low-level hardware/software issues and driving cross-functional RCA efforts.
- Familiarity with hyperscale infrastructure, AI clusters, HPC environments, or data center systems.
- Experience working with OEM, ODM, JDM, or hardware vendors.
- Strong scripting or software skills in Python, Go, Bash, or similar.
- Ability to operate effectively in fast-moving environments with high ownership and evolving technical requirements.
Preferred Skills
- Experience supporting GPU clusters or accelerator-based infrastructure at scale.
- Familiarity with cluster management, provisioning, or fleet lifecycle tooling.
- Experience with network automation, storage optimization, or systems observability.
- Background working across both hardware and software engineering organizations.
- Experience scaling greenfield infrastructure deployments or rapid expansion programs.
Benefits
- Base salary range listed: $335,000–$455,000 (total compensation also includes equity and possible performance-related bonuses).
- Medical, dental, and vision insurance with HSA contributions; Health FSA and Dependent Care FSA.
- 401(k) with employer match.
- Paid parental leave and paid medical/caregiver leave.
- Flexible PTO for exempt employees; paid holidays and additional paid office closures.
- Mental health and wellness support; employer-paid basic life and disability coverage.
- Annual learning and development stipend.
- Daily meals in offices and meal delivery credits as eligible.
- Relocation support for eligible employees and other taxable fringe benefits (e.g., donation matching, wellness stipends).
About the Team
The team spans hardware systems, networking, storage, cluster operations, and vendor ecosystems to turn aggressive compute growth plans into scalable, reliable production environments.
About OpenAI
OpenAI is an AI research and deployment company focused on ensuring general-purpose artificial intelligence benefits all of humanity. The company is an equal opportunity employer and provides information on applicant privacy and accommodations.