Member Of Technical Staff - Multimodal Understanding

at xAI

📍 Palo Alto, United States

USD 180,000-440,000 per year

MIDDLE

✅ On-site

Used Tools & Technologies

Machine Learning LLM

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Kubernetes @ 3 Python @ 5 Spark @ 3 Algorithms @ 3 Distributed Systems @ 3 Rust @ 5 PyTorch @ 6 GPU @ 3 AI @ 3 Data Pipelines @ 3 JAX @ 6

Details

xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The team is small, highly motivated, and focused on engineering excellence. Employees are expected to be hands-on, show initiative, communicate concisely, and contribute directly to the company’s mission.

About the role

Join the multimodal team to advance multimodal intelligence across image, video, audio, and text. Work across the full stack: data curation/acquisition, tokenizer training, large-scale pre-training, post-training/alignment, infrastructure/scaling, evaluation, tooling/demos, and end-to-end product experiences. Collaborate cross-functionally with pre-training, post-training, reasoning, data, applied, and product teams to deliver capabilities in multimodal reasoning, world modeling, tool use, agentic behaviors, and interactive human-AI collaboration.

Responsibilities

Design, build, and optimize large-scale distributed systems for multimodal pre-training, post-training, inference, data processing, and tokenization at web/petabyte scale.
Develop high-throughput pipelines for data acquisition, preprocessing, filtering, generation, decoding, loading, crawling, visualization, and management for images, videos, audio, and text.
Advance multimodal capabilities including spatial-temporal compression, cross-modal alignment, world modeling, reasoning, emergent abilities, audio/image/video understanding & generation, real-time video processing, and noisy data handling.
Drive data quality and studies: curation (human/synthetic), filtering techniques, analysis, and scalable pipelines to support trillion-parameter models.
Create evaluation frameworks, internal benchmarks, reward models, and metrics that capture real-world usage, failure modes, interactive dynamics, and human-AI synergy.
Innovate on algorithms, modeling approaches, hardware/software/algorithm co-design, and scaling paradigms for state-of-the-art performance.
Build research tooling, user-friendly interfaces, prototypes/demos, full-stack applications, and enable rapid iteration based on feedback.
Work across the stack (pre-training → SFT/RL/post-training) to enable reasoning, tool calling, agentic behaviors, orchestration, and seamless real-time interactions.

Basic qualifications

Hands-on experience with multimodal pre-training, post-training, or fine-tuning (vision, audio, video, or cross-modal).
Expert-level proficiency in Python (core language).
Strong experience in at least one of: JAX, PyTorch, or XLA.
Proven track record building or optimizing large-scale distributed ML systems (training/inference optimization, GPU utilization, multi-GPU/TPU setups, hardware co-design).
Deep experience designing and running data pipelines at scale: curation, filtering, generation, quality studies, especially for noisy/real-world multimodal data.
Strong fundamentals in evaluation design, benchmarks, reward modeling, or RL techniques (particularly for interactive/agentic behaviors).
Proactive self-starter who thrives in high-intensity environments and is willing to own end-to-end initiatives.

Preferred skills and experience

Experience leading major improvements in model capabilities through better data, modeling, algorithms, or scaling.
Familiarity with state-of-the-art multimodal LLMs, scaling laws, tokenizers, compression techniques, reasoning, or agentic systems.
Proficiency in Rust and/or C++ for performance-critical components.
Hands-on work with large-scale orchestration tools such as Spark, Ray, or Kubernetes.
Background building full-stack tooling: performant interfaces, real-time research demos/apps, or end-to-end product ownership.
Passion for end-to-end user experience in interactive, real-time multimodal AI systems.

Compensation and benefits

Base salary: $180,000 - $440,000 USD per year.
Total rewards include equity, comprehensive medical/vision/dental, access to a 401(k) plan, short & long-term disability insurance, life insurance, and other perks.

xAI is an equal opportunity employer. For details on data processing, view the Recruitment Privacy Notice linked in the original posting.