Senior ML Evaluation Engineer - Autonomous Vehicles

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning RAG

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 7 Spark @ 4 Leadership @ 4 AML @ 4 LLM @ 4 PyTorch @ 4 GPU @ 4 AI @ 4 Robotics @ 4 Agentic AI @ 4 LangChain @ 4 Prompt Engineering @ 4

Details

NVIDIA's AV Eval team is building the next generation of driving behavior evaluation — moving beyond hand-crafted rules to learned evaluation using LLMs, VLMs, and agentic workflows. You will define how we measure whether an autonomous vehicle drives well, building systems that bridge ML research and production evaluation. You will ship systems that run at scale on real-world driving data and produce metrics that block or green-light software releases. This role provides high ownership and visibility to NVIDIA AV leadership, working on next-gen AV evaluation with direct impact on vehicle safety and shipping decisions.

Responsibilities

Design and build learned evaluation pipelines that assess driving behavior using LLMs, VLMs, and multimodal models.
Develop agentic workflows that chain model inference, retrieval, and structured reasoning to evaluate complex driving scenarios.
Define evaluation-of-evaluation methodology to verify learned evaluators (e.g., how to know evaluators are correct).
Build golden-set frameworks and calibration loops for learned metrics.
Partner with AML (Alpamayo Logos) teams on model-specific evaluation needs (e.g., chain-of-thought prediction quality, AML regression coverage).
Instrument evaluation systems with robust experiment tracking, A/B comparison tooling, and model versioning.
Contribute to the team's transition from rule-based to learned evaluation: identify metrics and analyzers that are candidates for ML replacement and build alternatives.

Requirements

PhD with 4+ years, MS with 6+ years, or BS (or equivalent experience) with 8+ years of relevant experience in Computer Science, Computer Engineering, or a related technical field.
Hands-on experience building LLM/VLM-based pipelines — fine-tuning, prompt engineering, retrieval-augmented generation, chain-of-thought.
Track record of shipping ML systems to production (not just prototyping or publishing).
Strong software engineering fundamentals — clean, tested, reviewable code in Python and C++.
Experience with evaluation methodology: precision/recall, inter-rater reliability, calibration, annotation pipelines.
Comfortable with large-scale data processing (Spark, Dask, or similar).
Strong Python skills. Experience with PyTorch or JAX. Comfortable with GPU-based training workflows.

Ways to stand out

Autonomous driving, robotics, or safety-critical domain experience.
Familiarity with driving behavior taxonomies (cut-ins, hard braking events, lane-keeping metrics, scenario-based evaluation).
Experience with video understanding models or multi-modal evaluation.
Knowledge of agentic AI frameworks (LangChain, DSPy, CrewAI, or custom).
Track record of influencing technical direction across team boundaries.
Experience with LLM/VLM fine-tuning or application development.

Compensation & Additional Info

Base salary range (USD):
- Level 4: 184,000 - 287,500
- Level 5: 224,000 - 356,500
You will also be eligible for equity and benefits.
Applications accepted until March 29, 2026.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.