Engineering Manager, Agent Prompts & Evals

at Anthropic

📍 New York City, United States
📍 San Francisco, United States

USD 320,000-405,000 per year

MIDDLE

✅ Hybrid

✅ Visa Sponsorship

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

A/B Testing @ 3 CI/CD @ 3 Communication @ 3 API @ 3 Experimentation @ 3 LLM @ 3 AI @ 3 Prompt Engineering @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The company is building eval frameworks, system prompt pipelines, and regression-detection systems used to measure and ship model and prompt changes with confidence.

About the role

Anthropic is looking for an Engineering Manager to lead the Agent Prompts & Evals team. This team owns the infrastructure that lets Anthropic ship model and prompt changes with confidence — the eval frameworks, system prompt pipelines, and regression-detection systems that every model launch depends on. The team operates at the seam between product engineering and research, partnering with other eval groups, product teams, TPMs, and research PMs. The role combines platform ownership, hands-on partnership during model launches, and collaboration across teams.

Responsibilities

Lead and grow a team of prompt engineers and platform software engineers
Own the product-side eval platform: frameworks, dashboards, bulk runners, and CI integrations used to measure model behavior and catch regressions
Own system prompt infrastructure: versioning, deployment, rollback, and review tooling for prompts running in production across claude.ai, the API, and agentic surfaces
Be a steady hand through model launches; act as the operational backstop during high-stakes launch periods
Build durable collaboration with other evals groups: define ownership boundaries, shared roadmaps, and shared infrastructure practices
Recruit, close, and retain engineers who work at the intersection of product engineering and model behavior
Shape the team’s investment priorities (frontier eval development, model launch automation, deeper prompt engineering support)
Push toward measuring hard-to-measure properties: behavioral drift, prompt quality, harness parity, not just easy metrics

Requirements

8+ years in software engineering with 3+ years managing engineering teams, including experience leading a platform, infra, or developer-tooling team whose customers were other engineers
Track record of building tooling and processes that make it easy for other teams to do the right thing
Comfort managing a team with a mixed charter: platform ownership, service-to-other-teams, and launch-driven operational rhythm
Technical depth to engage on system design, review pipeline architecture, and be credible in technical debates; ability to read and review code and occasionally build
Product mindset and willingness to wear multiple hats
Demonstrated ability to build and maintain peer relationships with partner orgs: negotiate ownership, align roadmaps, and hold ground without being territorial
Experience recruiting and closing senior ICs in a competitive market

Strong candidates may also have

Prior exposure to LLM evals, ML experimentation platforms, or model quality work
Experience with A/B testing infrastructure, feature flagging, or gradual rollout systems
Background in devtools, CI/CD platforms, or testing infrastructure at scale
Experience managing teams that sit between larger orgs and turning that position into an asset
Interest in AI safety and alignment

Compensation

Annual Salary: $320,000 - $405,000 USD

Logistics

Education: At least a Bachelor's degree in a related field or equivalent experience
Location-based hybrid policy: staff are expected to be in one of Anthropic’s offices at least 25% of the time
Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist

How we work / Culture

Anthropic emphasizes collaborative, large-scale research efforts and values communication skills. The company highlights impact-focused research and frequent research discussions to align on high-impact directions.

Additional notes

Candidates are encouraged to apply even if they do not meet every qualification. Anthropic provides guidance on candidate AI usage in the application process and warns about recruitment scams (legitimate contacts come from @anthropic.com).