Senior Prompt and Benchmark Engineer, Evaluation of World Models

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Machine Learning @ 8 Communication @ 4 QA @ 4 NLP @ 8

Details

Work on Cosmos' generative AI evaluation team to design domain-specific benchmarks and prompt-based evaluations for world foundation models. This role focuses entirely on evaluation: designing prompts, question banks, test suites, and evaluation workflows for multimodal and agentic systems (video, simulation, robotics/AV, and physical environments). You will collaborate closely with researchers, annotators, and domain experts to translate evaluation needs into scalable, structured test cases. You will not be responsible for building production pipelines or infrastructure; those will be implemented by supporting engineers.

Responsibilities

  • Develop detailed, domain-specific benchmarks for world foundation models, with an emphasis on generation and understanding of video, simulation, and physical environments.
  • Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models.
  • Build, refine, and maintain question banks, multiple-choice formats, and test suites for automated and human evaluation workflows.
  • Employ multiple VLMs/LLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus.
  • Encode prompts and expected outputs into structured formats to enable automated, scalable evaluation pipelines.
  • Interface directly with Cosmos researchers to translate evaluation needs into scalable test cases and standardized metrics.
  • Collaborate with human annotators: create clear task instructions, feedback loops, and quality-control mechanisms to ensure dataset reliability.
  • Meet with domain experts in robotics, autonomous vehicles, and simulation to derive transferable metrics and co-develop standardized evaluation formats.

Requirements

  • Demonstrated experience with prompt engineering: crafting, refining, and optimizing prompts for evaluation and task elicitation.
  • Strong attention to detail designing natural language questions and formatting structured evaluations (question banks, multiple choice, test suites).
  • Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments.
  • Experience contributing to or designing benchmarks and evaluation datasets, especially for multimodal or agentic systems.
  • Familiarity with evaluating models via prompting, capturing structured outputs, and comparing performance across model families.
  • Working understanding of VLMs and foundation models at inference time, including token-level outputs, autoregressive decoding, and model context windows.
  • Experience collaborating with annotators and building QA processes for large-scale labeling campaigns.
  • Excellent communication and collaboration skills; regular interaction with researchers, annotators, and downstream users expected.
  • 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields.
  • BS, MS, or equivalent background. Prior experience in AI evaluation, annotation workflows, or research is highly valued.

Ways to Stand Out

  • Hands-on experience with multiple LLMs/VLMs (e.g., GPT, Claude, Gemini, Flamingo, Kosmos, IDEFICS) for output comparison and prompt engineering.
  • Prior work designing benchmarks for robotics, simulation, autonomous vehicles, or agentic tasks, especially multimodal/video-based evaluations.
  • Demonstrated experience using VLMs/LLMs as evaluators (response scoring, ranking, consensus aggregation).
  • Deep curiosity about model behavior and a drive to test, interrogate, and stretch the limits of generative systems.

Compensation & Logistics

  • This is a full-time role.
  • Note: This is a 100% evaluation-focused role; you will not be required to build pipelines or infrastructure.
  • Base salary ranges (location- and level-dependent):
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • You will also be eligible for equity and benefits. Applications accepted at least until November 22, 2025.

Equal Opportunity

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. NVIDIA does not discriminate on the basis of protected characteristics and values diversity in employees and applicants.