Research Engineer, Safeguards Labs

USD 350,000-850,000 per year
MIDDLE
✅ Hybrid
✅ Visa Sponsorship

Used Tools & Technologies

LLM

Required Skills & Competences

Security @ 3 Python @ 5 Machine Learning @ 3 Hiring @ 3 Scoping @ 3 Communication @ 3 Fraud @ 3 AI @ 3

Details

Anthropic is hiring Research Engineers for Safeguards Labs, a team that develops and prototypes novel safety methods to protect Claude and its users. The role combines research, engineering, and applied analysis to detect misuse, build safeguards, and move promising prototypes toward production.

Responsibilities

  • Lead and contribute to research projects investigating methods for detecting misuse of Claude, identifying malicious organizations and accounts, strengthening model safeguards, and other safety needs.
  • Design and run offline analyses over model usage data to surface abuse patterns, build classifiers and detection systems, and evaluate their effectiveness.
  • Develop and iterate on prototypes that could eventually feed signals into real-time safeguards, partnering with engineers on tech transfer.
  • Contribute to research on detecting abusive behavior in chat-based or agentive workflows, and training models to robustly refrain from dangerous responses without over-refusing.
  • Build evaluations and methodologies for measuring whether safeguards work, including in agentic settings.
  • Write up findings clearly to inform Trust & Safety, research, and product teams.

Requirements

  • Track record of independently driving research projects from ambiguous problems to concrete results, ideally in AI, ML, security, integrity, or related technical fields.
  • Comfortable scoping work and switching between research, engineering, and analysis as needed.
  • Working familiarity with how large language models operate (sampling, prompting, training).
  • Proficient in Python and comfortable working with large datasets.
  • Care about societal impacts of AI and want work to directly reduce real-world harm.
  • Minimum education: Bachelor’s degree or equivalent combination of education, training, and/or experience.
  • Minimum years of experience: correlated with internal job level requirements (not explicitly specified).

Strong candidates may also have

  • Experience building and training machine learning models, including classifiers for abuse, fraud, integrity, or security applications.
  • Knowledge of evaluation methodologies for language models and experience designing evals.
  • Experience with agentic environments and evaluating model behavior in them.
  • Background in trust and safety, integrity, fraud detection, threat intelligence, or adversarial ML.
  • Experience with red teaming, jailbreak research, or interpretability methods like steering vectors.
  • History of taking research prototypes and transferring them into production systems.

Compensation

  • Annual Salary: $350,000 - $850,000 USD

Logistics & Workplace Policy

  • Locations: San Francisco, CA and New York City, NY (United States).
  • Location-based hybrid policy: staff expected in one of the offices at least 25% of the time; some roles may require more office time.
  • Visa sponsorship: Anthropic states that they do sponsor visas and retain an immigration lawyer to assist, though sponsorship is not guaranteed for every role/candidate.

About the Team

Safeguards Labs operates at the intersection of research and engineering to prototype safety approaches for model behavior and usage safeguards, and to pressure-test ideas through offline analysis and subsets of traffic prior to production handoff.

About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The company values collaboration, communication, and impact-focused large-scale research. Benefits mentioned include competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and office collaboration spaces.