Research Engineer, Pretraining Scaling

USD 315,000-560,000 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 3 Communication @ 6 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 3 PyTorch @ 3

Details

Anthropic’s ML Performance and Scaling team trains production pretrained models and ensures frontier models train reliably, efficiently, and at scale. This role sits at the boundary between research and engineering and involves work across the full production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. The role is operationally intense and requires responsiveness during model launches and incidents.

Responsibilities

  • Own critical aspects of the production pretraining pipeline, including model operations, performance optimization, observability, and reliability
  • Debug and resolve complex issues across the full stack — from hardware errors and networking to training dynamics and evaluation infrastructure
  • Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
  • Respond to on-call incidents during model launches; diagnose problems quickly and coordinate solutions across teams
  • Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
  • Add new capabilities to the training codebase (examples noted: long context support or novel architectures)
  • Collaborate across teams and locations (San Francisco and London; cross-team collaboration with Tokens, Architectures, and Systems teams)
  • Document systems, debugging approaches, and lessons learned to contribute to institutional knowledge

Requirements

  • Hands-on experience training large language models OR deep expertise with one or more of: JAX, TPU, PyTorch, or large-scale distributed systems
  • Comfort and experience with performance optimization, observability, and improving training reliability at scale
  • Strong debugging skills across multiple layers of the stack (hardware, networking, training dynamics, evaluation)
  • Comfortable being on-call for production systems and working extended hours during launches
  • Ability to design and run experiments to improve training efficiency and model performance
  • Strong communication and collaboration skills, including coordinating across time zones and under high stress
  • Bachelor's degree in a related field or equivalent experience (required)

Strong candidates may also have:

  • Previous experience training LLMs or extensive work with JAX/TPU/PyTorch at scale
  • Contributions to open-source LLM frameworks (examples: open_lm, llm-foundry, mesh-transformer-jax)
  • Published research on model training, scaling laws, or ML systems
  • Experience with production ML systems, observability tools, or evaluation infrastructure
  • Background as a systems engineer, quant, or other roles combining technical depth and operational excellence

What Makes This Role Unique

  • Highly operational and hands-on with production model training; responsibilities include responding to incidents and long launch days
  • Opportunity to work on some of the largest and most sophisticated training runs in the industry and to learn from world-class researchers and engineers
  • Close-knit team culture emphasizing impact and collaboration

Logistics / Location / Hours

  • Location: role requires working in-office 5 days per week in San Francisco, CA
  • Visa sponsorship: Anthropic states they do sponsor visas in many cases and retain an immigration lawyer to assist
  • Education: at least a Bachelor's degree in a related field or equivalent experience

Compensation & Benefits

  • Annual base salary range: $315,000 - $560,000 USD
  • Total compensation for full-time employees includes equity, benefits, and may include incentive compensation
  • Other benefits: competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space

How to Apply

  • Applications are reviewed on a rolling basis. The application form requests resume or LinkedIn, contact information, and other optional materials. Anthropic encourages applicants to apply even if they do not meet every qualification listed.