Software Engineer, Collective Communication
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 3 Algorithms @ 2 Communication @ 3 Networking @ 3 CUDA @ 3 GPU @ 3Details
About the Team
The Workload Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.
The models we train are key ingredients to the AI research progress at OpenAI and the field as a whole, and we continually incorporate learnings from our entire research org into our training platform.
About the Role
As a Software Engineer, Networking you will design and implement custom networking collectives that are tightly integrated into our training stack. Were looking for people who have a background in low level performance critical software. Experience with collective communication is a bonus.
This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.
Responsibilities
- Collaborate closely with ML researchers to design and implement efficient collective operations in C++ and CUDA.
- Ensure that our largest training jobs take full advantage of the different network transports used in our supercomputers.
- Work on simulations to inform our future supercomputer network designs.
Requirements
- Background in low-level, performance-critical software development.
- Comfortable writing low-level performance-sensitive CPU and/or GPU code (C++ and CUDA).
- Experience or familiarity with collective communication and distributed algorithms (experience with RDMA a plus).
- Familiarity with network simulation techniques and high-performance computing/network transports.
Benefits
- Competitive base pay (see salary range) plus equity and potential bonuses.
- Medical, dental, and vision insurance with employer contributions to Health Savings Accounts.
- Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses.
- 401(k) retirement plan with employer match.
- Paid parental, medical, and caregiver leave; PTO; paid company holidays and closures.
- Mental health and wellness support; employer-paid basic life and disability coverage.
- Annual learning and development stipend; daily meals in offices and meal credits as eligible.
- Relocation support for eligible employees.
Additional Notes
- Background checks will be administered in accordance with applicable law.
- OpenAI is an equal opportunity employer and provides reasonable accommodations to applicants with disabilities.