Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 5
Kafka @ 3
Scala @ 5
Spark @ 3
Distributed Systems @ 3
Flink @ 3
Performance Optimization @ 6
Rust @ 5
Debugging @ 6
API @ 3
Hadoop @ 3
Experimentation @ 3
Trino @ 3
Observability @ 3
AI @ 3
Profiling @ 6
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The team is small, highly motivated, and focused on engineering excellence. The organization operates with a flat structure where all employees are expected to be hands-on, show initiative, and communicate effectively.
Role overview
The Data Platform team builds and operates infrastructure for large-scale data transport and processing across the company. Core systems include Apache Kafka, HDFS, Spark, Flink, and Trino. The team supports real-time ML pipelines, feed ranking, experimentation, analytics, and observability at petabyte scale and handles latency-critical, high-throughput streaming and distributed compute workloads that demand fault tolerance, performance, and reliability. As a software engineer on the team you will design, build, and operate distributed systems that process trillions of events daily and power product and ML workloads across the company.
Responsibilities
- Design and implement high-throughput, low-latency data ingestion and transport systems.
- Scale and optimize multi-tenant Kafka infrastructure supporting real-time workloads.
- Extend and tune Spark, Flink, and Trino for demanding production pipelines.
- Build interfaces, APIs, and pipelines enabling teams to query, process, and move data at petabyte scale.
- Debug and optimize distributed systems, focusing on reliability and performance under load.
- Collaborate with ML, product, and infrastructure teams to unblock critical data workflows.
Requirements
- Proven expertise in distributed systems, stream processing, or large-scale data platforms.
- Proficiency in Rust, Go, Scala, or similar systems languages.
- Hands-on experience with Kafka, Flink, Spark, Trino, or Hadoop in production.
- Strong debugging, profiling, and performance optimization skills.
- Track record of shipping and maintaining critical infrastructure.
- Comfortable working in fast-moving, high-stakes environments with minimal guardrails.
Compensation and Benefits
- Base salary: $180,000 - $440,000 USD
- Total rewards package also includes equity, comprehensive medical/vision/dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and other discounts and perks.
Additional details
- The team operates at petabyte scale and processes trillions of events daily.
- Emphasis on reliability, performance, fault tolerance, and low-latency processing.