Research Engineer, Pretraining Scaling (London)

GBP 250,000-435,000 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 3 Communication @ 3 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 3 PyTorch @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial for users and society. The ML Performance and Scaling team trains production pretrained models and ensures they train reliably, efficiently, and at scale. This role sits at the intersection of research and engineering and involves working across the full production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. The team operates closely during model launches and responds to production issues that require immediate attention.

Responsibilities

  • Own critical aspects of the production pretraining pipeline, including model operations, performance optimization, observability, and reliability.
  • Debug and resolve complex issues across the full stack — from hardware errors and networking to training dynamics and evaluation infrastructure.
  • Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance.
  • Respond to on-call incidents during model launches, diagnose problems quickly, and coordinate solutions across teams.
  • Build and maintain production logging, monitoring dashboards, and evaluation infrastructure.
  • Add new capabilities to the training codebase, such as long context support or novel architectures.
  • Collaborate closely with teammates across San Francisco and London, and with Tokens, Architectures, and Systems teams.
  • Document systems, debugging approaches, and lessons learned to contribute to the team's institutional knowledge.

Requirements

  • Hands-on experience training large language models or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems.
  • Comfortable splitting time between research and engineering work (roughly a 50/50 split preferred).
  • Willingness to be on-call for production systems and to work extended hours during launches, including evenings and weekends when required.
  • Strong debugging skills for complex, ambiguous problems across multiple layers of the stack.
  • Clear communication and effective collaboration, including coordinating across time zones and under high-stress incidents.
  • Passion for refining your craft as a research engineer and attention to the societal impacts of AI and responsible scaling.

Strong candidates may also have:

  • Previous experience training LLMs or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale.
  • Contributions to open-source LLM frameworks (examples: open_lm, llm-foundry, mesh-transformer-jax).
  • Publications on model training, scaling laws, or ML systems.
  • Experience with production ML systems, observability tools, or evaluation infrastructure.
  • Background as a systems engineer, quant, or other roles combining technical depth and operational excellence.

Location & Logistics

  • This role requires working in-office 5 days per week in London.
  • Education: at least a Bachelor's degree in a related field or equivalent experience is required.
  • Location-based hybrid policy: the company currently expects all staff to be in one of their offices at least 25% of the time, though some roles may require more time in the office.
  • Visa sponsorship: Anthropic does sponsor visas and retains immigration counsel; sponsorship is evaluated per role and candidate.
  • Application deadline: None (applications reviewed on a rolling basis).

Compensation & Benefits

  • Expected base annual salary: £250,000 - £435,000 GBP.
  • Total compensation package for full-time employees includes equity, benefits, and may include incentive compensation.
  • Company highlights: competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

What Makes This Role Unique

  • Highly operational research engineering work with direct responsibility for keeping production models training smoothly.
  • Significant learning opportunities working on large-scale training runs alongside researchers and engineers.
  • The role emphasizes impact, responsiveness during launches, and building institutional knowledge that compounds over time.

How to Apply

  • Candidates are encouraged to apply even if they do not meet every qualification listed. Anthropic values diverse perspectives and recognizes that not all strong candidates will match every listed requirement.