Research Engineer, Pretraining Scaling (London)

GBP 250,000-435,000 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 3 Communication @ 3 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 3 PyTorch @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The ML Performance and Scaling team trains production pretrained models. As a Research Engineer on this team you will ensure frontier models train reliably, efficiently, and at scale, working across the full production training stack including performance optimization, hardware debugging, experimental design, and launch coordination.

Location: London, United Kingdom

Annual base salary range: £250,000 - £435,000 GBP

Responsibilities

  • Own critical aspects of the production pretraining pipeline: model operations, performance optimization, observability, and reliability
  • Debug and resolve complex issues across the full stack — from hardware errors and networking to training dynamics and evaluation infrastructure
  • Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
  • Respond to on-call incidents during model launches; diagnose problems quickly and coordinate solutions across teams
  • Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
  • Add new capabilities to the training codebase (examples given: long context support or novel architectures)
  • Collaborate closely with teammates across San Francisco and London and with Tokens, Architectures, and Systems teams
  • Document systems, debugging approaches, and lessons learned to contribute to team institutional knowledge

Requirements

  • Hands-on experience training large language models or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems
  • Comfortable with both research and engineering work (roughly a 50/50 split is desirable)
  • Willingness to be on-call for production systems and to work long days during launches (including evenings/weekends as needed)
  • Strong debugging skills across multiple layers of the stack and ability to coordinate under pressure
  • Clear communication and effective collaboration across time zones
  • At least a Bachelor’s degree in a related field or equivalent experience (required)

Strong Candidates May Also Have

  • Previous experience training LLMs or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale
  • Contributions to open-source LLM frameworks (examples: open_lm, llm-foundry, mesh-transformer-jax)
  • Published research on model training, scaling laws, or ML systems
  • Experience with production ML systems, observability tools, or evaluation infrastructure
  • Background as a systems engineer, quant, or roles requiring both technical depth and operational excellence

What Makes This Role Unique

  • Highly operational research engineering work focused on keeping production models training smoothly
  • Extended hours and responsiveness are expected during launches, balanced by deep learning opportunities working with large-scale training runs and world-class researchers/engineers

Logistics & Other Notes

  • This role requires working in-office 5 days per week in London
  • Visa sponsorship is possible (subject to role and candidate fit)
  • Applications reviewed on a rolling basis
  • Compensation includes base salary, equity, benefits, and may include incentive compensation

Benefits

  • Competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and an office workspace for collaboration