Used Tools & Technologies
HPCRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Software Development @ 3
Algorithms @ 3
Distributed Systems @ 3
Communication @ 6
Networking @ 3
Prioritization @ 6
Data Analysis @ 3
Debugging @ 3
GPU @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The team is small, highly motivated, and focused on engineering excellence. Engineers are expected to be hands-on, to contribute directly to the company’s mission, and to have strong communication and prioritization skills.
Role overview
You will work on Colossus: a custom, high-performance datacenter network that powers Grok and frontier AI models. The Colossus Networking team builds the software and control systems for massive GPU clusters and the high-speed interconnect fabric used for large-scale AI training. Engineers own the full lifecycle of their software — from design and implementation to deployment, monitoring, and iteration based on real-world performance at scale.
You will solve hard problems in distributed systems, high-performance networking, and real-time control of a very large AI training fabric. Your work will directly impact training efficiency, model convergence, and overall system performance.
Responsibilities
- Develop routing and traffic-engineering algorithms for the Colossus high-performance datacenter network.
- Develop highly reliable, real-time software designed to run on the switches that form the backbone of the low-latency, high-bandwidth AI training fabric.
- Participate in and lead architecture, design, and code reviews.
- Develop prototypes and run experiments to validate key design decisions at both small and full-cluster scale.
- Build tools for software development, deployment, data analysis, visualization, and testing across virtualized environments, hardware-in-the-loop setups, and live production clusters.
- Deploy reliable software updates through continuous integration and release systems with rigorous testing and monitoring.
Basic qualifications
- Bachelor’s degree in computer science, engineering, math, or a related technical discipline; OR 2+ years of professional software development experience in lieu of a degree.
- Strong development experience in C or C++.
Preferred skills and experience
- Strong professional experience writing high-performance C/C++ in production environments.
- Experience developing, debugging, and deploying software that runs at scale in real-world systems.
- Deep knowledge of networking protocols (UDP, TCP/IP, RDMA, etc.), distributed systems, and large-scale datacenter fabrics.
- Background in real-time systems, high-performance computing, low-latency networking, or resource-constrained environments.
- Experience with security considerations in large-scale distributed systems.
- Creative problem-solving ability, exceptional analytical skills, strong engineering fundamentals, and excellent written and verbal communication skills.
- Ability to thrive in a fast-paced, dynamic environment with evolving requirements.
Additional requirements
- Must be willing to work extended hours and weekends as needed.
Compensation and benefits
- Base salary: $180,000 - $440,000 USD per year.
- Total rewards package also includes equity, comprehensive medical/vision/dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
xAI is an equal opportunity employer. For details on data processing, view the Recruitment Privacy Notice linked in the posting.