Data Scientist, Evals
USD 210,000-385,000 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 6
SQL @ 6
Machine Learning @ 5
Data Science @ 5
Leadership @ 3
AWS @ 3
Technical Leadership @ 3
Databricks @ 3
LLM @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Perplexity serves tens of millions of users daily with reliable, high-quality answers grounded in an LLM-first search engine and specialized data sources. This role focuses on building specialized evaluations (evals) to improve answer quality across Perplexity, covering search-based LLM answers and other user-facing scenarios.
Responsibilities
- Architect and maintain automated evaluation pipelines to assess answer quality across Perplexity's products, ensuring high standards for accuracy and helpfulness.
- Design evaluation sets and methods specifically to measure the impact of tool calls (particularly web search retrieval) on the final answer's quality.
- Develop VLM-based solutions to programmatically evaluate how final answers render visually across different platforms and devices.
- Continuously review public benchmarks and academic evaluations for their applicability to the Perplexity product, adapting and incorporating them into regular performance measurements.
- Operate within a small, high-impact team where evaluation metrics directly shape product changes, collaborating closely with technical leadership to measure and improve Answer Quality.
Requirements
- PhD or MS in a technical field or equivalent experience.
- 4+ years of experience in data science or machine learning.
- Strong proficiency in Python and SQL (expected to write production-grade code).
- Experience building within a modern cloud data stack, specifically AWS and Databricks.
- Comfortable with agentic coding workflows and using AI-assisted development tools to iterate faster.
Preferred Qualifications
- 1+ years of experience working with LLMs at scale, specifically with LLM-as-a-judge setups.
- Prior experience working on customer-facing web products or consumer apps with real user traffic at scale.
- A strong research background, with experience applying research methods to real-world ML problems.
- Experience defining evaluation metrics (e.g., factual consistency, hallucination rate, retrieval precision) and building ground truth datasets.
Benefits
- U.S. Benefits: Full-time U.S. employees enjoy a comprehensive benefits program including equity, health, dental, vision, retirement, fitness, commuter and dependent care accounts, and more.
- International Benefits: Full-time employees outside the U.S. enjoy a comprehensive benefits program tailored to their region of residence.
- Note: USD salary ranges apply only to U.S.-based positions. International salaries are set based on the local market; final offers are determined by multiple factors, including experience and expertise.