Research Engineer, Pretraining Scaling (London)

at Anthropic

📍 London, United Kingdom

GBP 250,000-435,000 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Distributed Systems @ 3 Communication @ 3 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 3 PyTorch @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The ML Performance and Scaling team trains production pretrained models. As a Research Engineer on this team you will ensure frontier models train reliably, efficiently, and at scale, working across the full production training stack including performance optimization, hardware debugging, experimental design, and launch coordination.

Location: London, United Kingdom

Annual base salary range: £250,000 - £435,000 GBP

Responsibilities

Own critical aspects of the production pretraining pipeline: model operations, performance optimization, observability, and reliability
Debug and resolve complex issues across the full stack — from hardware errors and networking to training dynamics and evaluation infrastructure
Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
Respond to on-call incidents during model launches; diagnose problems quickly and coordinate solutions across teams
Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
Add new capabilities to the training codebase (examples given: long context support or novel architectures)
Collaborate closely with teammates across San Francisco and London and with Tokens, Architectures, and Systems teams
Document systems, debugging approaches, and lessons learned to contribute to team institutional knowledge

Requirements

Hands-on experience training large language models or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems
Comfortable with both research and engineering work (roughly a 50/50 split is desirable)
Willingness to be on-call for production systems and to work long days during launches (including evenings/weekends as needed)
Strong debugging skills across multiple layers of the stack and ability to coordinate under pressure
Clear communication and effective collaboration across time zones
At least a Bachelor’s degree in a related field or equivalent experience (required)

Strong Candidates May Also Have

Previous experience training LLMs or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale
Contributions to open-source LLM frameworks (examples: open_lm, llm-foundry, mesh-transformer-jax)
Published research on model training, scaling laws, or ML systems
Experience with production ML systems, observability tools, or evaluation infrastructure
Background as a systems engineer, quant, or roles requiring both technical depth and operational excellence

What Makes This Role Unique

Highly operational research engineering work focused on keeping production models training smoothly
Extended hours and responsiveness are expected during launches, balanced by deep learning opportunities working with large-scale training runs and world-class researchers/engineers

Logistics & Other Notes

This role requires working in-office 5 days per week in London
Visa sponsorship is possible (subject to role and candidate fit)
Applications reviewed on a rolling basis
Compensation includes base salary, equity, benefits, and may include incentive compensation

Benefits

Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and an office workspace for collaboration