Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
IaC @ 3
Python @ 3
GPU @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The Network Software and Services for AI (nssAI) team builds software, services, and frameworks to empower Network Development Engineers. This role is hands-on and focuses on automation-first solutions for production and ancillary networks, including metric collection, configuration, zero-touch provisioning, monitoring, and auto-remediation.
Responsibilities
- Build cutting-edge software, services, and frameworks to support network development and operations.
- Develop extensible tools and streamline complex processes to ensure reliability for large GPU supercomputing network fabrics used for AI training and inference.
- Implement Infrastructure-as-Code (IaC) best practices and enhance deployment pipelines.
- Provide extensive metrics coverage and monitoring to prioritize work and drive automation.
- Support zero-touch provisioning, configuration management, and auto-remediation workflows.
- Travel to Palo Alto and data center locations for collaboration and hands-on validation of software.
Requirements / Preferred Skills and Experience
- Deep experience collaborating daily with network engineers and extensive knowledge of network topologies (physical and logical) and network protocols.
- Expert knowledge and proven history designing scalable, reliable software capable of orchestrating large numbers of network devices.
- Ability to thrive in ambiguity and create metrics that help prioritize team focus.
- Familiarity or experience with IaC, deployment pipelines, monitoring, and secure service delivery in production.
Tech Stack / Technologies Mentioned
- Python
- Go
- TCP/IP
- BGP
- RDMA
- Infrastructure as Code (IaC)
- Metrics collection, monitoring, zero-touch provisioning, and auto-remediation
Location
The role is based in the offices of Palo Alto, California or Memphis, Tennessee, or Remote. There will be travel expected to Palo Alto for inter-team collaboration and to the data center for hands-on experience using the software you write.
Benefits
Base salary is one part of the total rewards package at xAI and also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
Annual Salary Range: $60,000 - $88,000