Machine Learning Engineer, LLM Evals & Observability

at Glean

📍 Mountain View, United States

USD 200,000-300,000 per year

MIDDLE

✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 6 Python @ 6 Machine Learning @ 3 NLP @ 3 LLM @ 3 Observability @ 3 AI @ 3 Reinforcement Learning @ 3 Data Pipelines @ 3

Details

Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry’s most advanced enterprise search has evolved into a full-scale Work AI ecosystem powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. Glean’s Enterprise Graph and Personal Knowledge Graph map relationships between people, content, and activity to deliver personalized, context-aware responses. The company powers agentic capabilities that automate real work across teams by accessing enterprise and external data.

Responsibilities

Design and curate evaluation datasets — sampling strategies, query diversity, and golden sets that provide representative coverage of real assistant behavior.
Build and maintain large-scale evaluation pipelines that measure assistant quality across thousands of real user queries.
Build LLM-powered judges that score correctness, completeness, and response quality, and align automated scores with human judgment.
Evaluate new models and product changes before they ship, providing quality signals that gate launches and prevent regressions.
Build observability infrastructure for AI agents: trace enrichment, data pipelines, and dashboards that make assistant behavior inspectable.
Close the loop between quality measurement and improvement using eval results, customer feedback, and techniques like automated prompt iteration to drive improvements in assistant behavior.
Collaborate with engineers across the company to make evals a first-class part of how the product is shipped.

Requirements

2+ years of software engineering experience with strong coding skills.
Strong backend fundamentals in Go and Python; comfortable with distributed data pipelines.
Experience working with LLM evaluation, reinforcement learning from human feedback (RLHF), natural language processing (NLP), or other large systems involving machine learning.
Analytically rigorous — ability to reason about what offline metrics predict about real user experience.
Ability to thrive in a customer-focused, cross-functional environment and to prioritize work that is most impactful for the company.
Strong care about quality in both systems and the product being measured and improved.

Location

This role is hybrid (3–4 days a week in one of our San Francisco Bay Area offices). Location listed: Mountain View, California.

Compensation & Benefits

Base salary range: $200,000 - $300,000 annually. Compensation will be determined by factors such as location, level, knowledge, skills, and experience. Certain roles may be eligible for variable compensation, equity, and benefits.
Benefits include Medical, Vision, and Dental coverage, generous time-off policy, opportunity to contribute to a 401(k) plan, a home office improvement stipend, and annual education and wellness stipends.
Additional perks: regular company events and daily healthy lunches.

AI-First Mindset

As part of the interview process, candidates will complete a brief AI-focused exercise or discussion to demonstrate how they think about, design, and use AI to drive impact in the role.