Research Engineer, Pretraining Scaling

at Anthropic

📍 San Francisco, United States

USD 315,000-560,000 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Distributed Systems @ 3 Communication @ 6 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 3 PyTorch @ 3

Details

Anthropic’s ML Performance and Scaling team trains production pretrained models and ensures frontier models train reliably, efficiently, and at scale. This role sits at the boundary between research and engineering and involves work across the full production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. The role is operationally intense and requires responsiveness during model launches and incidents.

Responsibilities

Own critical aspects of the production pretraining pipeline, including model operations, performance optimization, observability, and reliability
Debug and resolve complex issues across the full stack — from hardware errors and networking to training dynamics and evaluation infrastructure
Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
Respond to on-call incidents during model launches; diagnose problems quickly and coordinate solutions across teams
Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
Add new capabilities to the training codebase (examples noted: long context support or novel architectures)
Collaborate across teams and locations (San Francisco and London; cross-team collaboration with Tokens, Architectures, and Systems teams)
Document systems, debugging approaches, and lessons learned to contribute to institutional knowledge

Requirements

Hands-on experience training large language models OR deep expertise with one or more of: JAX, TPU, PyTorch, or large-scale distributed systems
Comfort and experience with performance optimization, observability, and improving training reliability at scale
Strong debugging skills across multiple layers of the stack (hardware, networking, training dynamics, evaluation)
Comfortable being on-call for production systems and working extended hours during launches
Ability to design and run experiments to improve training efficiency and model performance
Strong communication and collaboration skills, including coordinating across time zones and under high stress
Bachelor's degree in a related field or equivalent experience (required)

Strong candidates may also have:

Previous experience training LLMs or extensive work with JAX/TPU/PyTorch at scale
Contributions to open-source LLM frameworks (examples: open_lm, llm-foundry, mesh-transformer-jax)
Published research on model training, scaling laws, or ML systems
Experience with production ML systems, observability tools, or evaluation infrastructure
Background as a systems engineer, quant, or other roles combining technical depth and operational excellence

What Makes This Role Unique

Highly operational and hands-on with production model training; responsibilities include responding to incidents and long launch days
Opportunity to work on some of the largest and most sophisticated training runs in the industry and to learn from world-class researchers and engineers
Close-knit team culture emphasizing impact and collaboration

Logistics / Location / Hours

Location: role requires working in-office 5 days per week in San Francisco, CA
Visa sponsorship: Anthropic states they do sponsor visas in many cases and retain an immigration lawyer to assist
Education: at least a Bachelor's degree in a related field or equivalent experience

Compensation & Benefits

Annual base salary range: $315,000 - $560,000 USD
Total compensation for full-time employees includes equity, benefits, and may include incentive compensation
Other benefits: competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space

How to Apply

Applications are reviewed on a rolling basis. The application form requests resume or LinkedIn, contact information, and other optional materials. Anthropic encourages applicants to apply even if they do not meet every qualification listed.