Senior Scientist, Synthetic Data Generation

at Nvidia
USD 168,000-304,800 per year
SENIOR
βœ… On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Statistics @ 4 Machine Learning @ 4 Leadership @ 4 Git @ 4 API @ 7 Technical Leadership @ 4 LLM @ 4 AI @ 4 vLLM @ 7 Data Pipelines @ 4

Details

NVIDIA is at the forefront of the AI revolution, and our research is shaping the future of large language models. We are looking for a Senior Scientist to join our team and help advance our capabilities in synthetic data generation for training frontier models. You will contribute to open-source libraries within the NVIDIA NeMo ecosystem that generate synthetic datasets across text, code, structured, and multimodal data, directly feeding the pre- and post-training of LLMs such as Nemotron. This role combines hands-on software engineering with applied research in generative methods, and you will collaborate with research, engineering, product, and model teams as well as external labs.

Responsibilities

  • Build synthetic data generation pipelines using LLM-based methods and automated quality evaluation, producing datasets that improve the pre- and post-training of LLMs such as Nemotron β€” reasoning, coding, structured output, and multimodal understanding.
  • Advance multimodal synthetic data generation β€” image, document, video, and audio β€” in partnership with NVIDIA's model teams.
  • Design and maintain open-source libraries and SDKs with clean APIs and strong documentation.
  • Drive software excellence with modern tooling, architecture based on configuration, and professional Git/CI-CD practices.
  • Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.
  • Mentor interns and junior researchers to develop technical growth within the team.

Requirements

  • PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience.
  • Research background of 3+ years in synthetic data generation, generative modeling, multimodal machine learning, or related areas (comparable experience considered).
  • Deep technical understanding of LLMs, how data shapes their pre- and post-training, and inference frameworks such as vLLM or TGI.
  • Proven track record of developing or maintaining software libraries used by a broad developer community.
  • Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar.

Ways to stand out from the crowd

  • Open-source contributions in ML or data tooling.
  • Experience with multimodal generation or understanding (vision-language, document AI, video, or audio).
  • Building and optimizing scalable data pipelines for large-scale model training (throughput, distributed inference).
  • Experience generating data for agentic, tool-use, or reinforcement-learning post-training.

Compensation and benefits

  • Base salary range (Level 3): 168,000 USD - 264,500 USD.
  • Base salary range (Level 4): 192,000 USD - 304,750 USD.
  • Eligible for equity and benefits (link to NVIDIA benefits provided in the original posting).

Other information

  • Applications for this job will be accepted at least until June 14, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is committed to fostering an inclusive work environment and is an equal opportunity employer. The company does not discriminate on the basis of protected characteristics.