Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
A/B Testing @ 3
CI/CD @ 3
Communication @ 3
API @ 3
Experimentation @ 3
LLM @ 3
AI @ 3
Prompt Engineering @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. The company is building eval frameworks, system prompt pipelines, and regression-detection systems used to measure and ship model and prompt changes with confidence.
About the role
Anthropic is looking for an Engineering Manager to lead the Agent Prompts & Evals team. This team owns the infrastructure that lets Anthropic ship model and prompt changes with confidence — the eval frameworks, system prompt pipelines, and regression-detection systems that every model launch depends on. The team operates at the seam between product engineering and research, partnering with other eval groups, product teams, TPMs, and research PMs. The role combines platform ownership, hands-on partnership during model launches, and collaboration across teams.
Responsibilities
- Lead and grow a team of prompt engineers and platform software engineers
- Own the product-side eval platform: frameworks, dashboards, bulk runners, and CI integrations used to measure model behavior and catch regressions
- Own system prompt infrastructure: versioning, deployment, rollback, and review tooling for prompts running in production across claude.ai, the API, and agentic surfaces
- Be a steady hand through model launches; act as the operational backstop during high-stakes launch periods
- Build durable collaboration with other evals groups: define ownership boundaries, shared roadmaps, and shared infrastructure practices
- Recruit, close, and retain engineers who work at the intersection of product engineering and model behavior
- Shape the team’s investment priorities (frontier eval development, model launch automation, deeper prompt engineering support)
- Push toward measuring hard-to-measure properties: behavioral drift, prompt quality, harness parity, not just easy metrics
Requirements
- 8+ years in software engineering with 3+ years managing engineering teams, including experience leading a platform, infra, or developer-tooling team whose customers were other engineers
- Track record of building tooling and processes that make it easy for other teams to do the right thing
- Comfort managing a team with a mixed charter: platform ownership, service-to-other-teams, and launch-driven operational rhythm
- Technical depth to engage on system design, review pipeline architecture, and be credible in technical debates; ability to read and review code and occasionally build
- Product mindset and willingness to wear multiple hats
- Demonstrated ability to build and maintain peer relationships with partner orgs: negotiate ownership, align roadmaps, and hold ground without being territorial
- Experience recruiting and closing senior ICs in a competitive market
Strong candidates may also have
- Prior exposure to LLM evals, ML experimentation platforms, or model quality work
- Experience with A/B testing infrastructure, feature flagging, or gradual rollout systems
- Background in devtools, CI/CD platforms, or testing infrastructure at scale
- Experience managing teams that sit between larger orgs and turning that position into an asset
- Interest in AI safety and alignment
Compensation
Annual Salary: $320,000 - $405,000 USD
Logistics
- Education: At least a Bachelor's degree in a related field or equivalent experience
- Location-based hybrid policy: staff are expected to be in one of Anthropic’s offices at least 25% of the time
- Visa sponsorship: Anthropic states they do sponsor visas and retain an immigration lawyer to assist
How we work / Culture
Anthropic emphasizes collaborative, large-scale research efforts and values communication skills. The company highlights impact-focused research and frequent research discussions to align on high-impact directions.
Additional notes
Candidates are encouraged to apply even if they do not meet every qualification. Anthropic provides guidance on candidate AI usage in the application process and warns about recruitment scams (legitimate contacts come from @anthropic.com).