Research Engineer, Pretraining Scaling

USD 315,000-560,000 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Distributed Systems @ 3 Communication @ 6 Networking @ 3 Performance Optimization @ 3 Debugging @ 3 LLM @ 3 PyTorch @ 3

Details

Anthropic’s ML Performance and Scaling team trains production pretrained models and ensures frontier models train reliably, efficiently, and at scale. This role sits between research and engineering and involves working across the full production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. The role is operationally intensive during model launches and requires responding to on-call incidents and coordinating across teams in multiple time zones.

Responsibilities

  • Own critical aspects of the production pretraining pipeline, including model operations, performance optimization, observability, and reliability.
  • Debug and resolve complex issues across the full stack — from hardware errors and networking to training dynamics and evaluation infrastructure.
  • Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance.
  • Respond to on-call incidents during model launches, diagnose problems quickly, and coordinate solutions across teams.
  • Build and maintain production logging, monitoring dashboards, and evaluation infrastructure.
  • Add new capabilities to the training codebase (examples given: long context support or novel architectures).
  • Collaborate closely with teammates across San Francisco and London and with Tokens, Architectures, and Systems teams.
  • Document systems, debugging approaches, and lessons learned to contribute to institutional knowledge.

Requirements

  • Hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems.
  • Comfortable with a roughly 50/50 split between research and engineering work (not heavily weighted to one side).
  • Willingness to be on-call for production systems and to work long days during launches (including evenings and weekends when needed).
  • Strong debugging skills for complex, ambiguous problems across multiple layers of the stack.
  • Ability to communicate clearly and collaborate effectively across time zones and under stress.
  • Passion for refining craft as a research engineer and care about societal impacts of AI and responsible scaling.
  • Education: at least a Bachelor's degree in a related field or equivalent experience.

Strong Candidates May Also Have

  • Previous experience training LLMs or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale.
  • Contributions to open-source LLM frameworks (examples: open_lm, llm-foundry, mesh-transformer-jax).
  • Published research on model training, scaling laws, or ML systems.
  • Experience with production ML systems, observability tools, or evaluation infrastructure.
  • Background as a systems engineer, quantitative researcher, or similar roles requiring technical depth and operational excellence.

What Makes This Role Unique

  • Highly operational role: deep involvement in keeping production models training smoothly, requiring responsiveness to incidents and flexibility in priorities.
  • Opportunity to work on some of the largest, most sophisticated training runs in industry alongside world-class researchers and engineers.
  • Extended learning through operational responsibility and close collaboration; institutional knowledge built in this role compounds over time.

Logistics & Location

  • Location: This role requires working in-office 5 days per week in San Francisco, CA.
  • Visa sponsorship: Anthropic does sponsor visas and retains an immigration lawyer, though sponsorship is not guaranteed for every role/candidate.
  • Application deadline: None — applications reviewed on a rolling basis.

Compensation & Benefits

  • Expected base annual salary range: $315,000 - $560,000 USD. Total compensation may include equity, benefits, and incentive compensation.
  • Benefits mentioned: competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office space for collaboration.

Other Notes

  • Team expects strong communication and close collaboration; the company emphasizes diversity and encourages applicants who may not meet every qualification to apply.