Senior Scientist, Synthetic Data Generation

at Nvidia

📍 Santa Clara, United States

USD 168,000-304,800 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Statistics @ 4 Machine Learning @ 4 Leadership @ 4 Git @ 4 API @ 7 Technical Leadership @ 4 LLM @ 4 AI @ 4 vLLM @ 7 Data Pipelines @ 4

Details

NVIDIA is at the forefront of the AI revolution, and our research is shaping the future of large language models. We are looking for a Senior Scientist to join our team and help advance our capabilities in synthetic data generation for training frontier models. You will contribute to open-source libraries within the NVIDIA NeMo ecosystem that generate synthetic datasets across text, code, structured, and multimodal data, directly feeding the pre- and post-training of LLMs such as Nemotron. This role combines hands-on software engineering with applied research in generative methods, and you will collaborate with research, engineering, product, and model teams as well as external labs.

Responsibilities

Build synthetic data generation pipelines using LLM-based methods and automated quality evaluation, producing datasets that improve the pre- and post-training of LLMs such as Nemotron — reasoning, coding, structured output, and multimodal understanding.
Advance multimodal synthetic data generation — image, document, video, and audio — in partnership with NVIDIA's model teams.
Design and maintain open-source libraries and SDKs with clean APIs and strong documentation.
Drive software excellence with modern tooling, architecture based on configuration, and professional Git/CI-CD practices.
Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.
Mentor interns and junior researchers to develop technical growth within the team.

Requirements

PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience.
Research background of 3+ years in synthetic data generation, generative modeling, multimodal machine learning, or related areas (comparable experience considered).
Deep technical understanding of LLMs, how data shapes their pre- and post-training, and inference frameworks such as vLLM or TGI.
Proven track record of developing or maintaining software libraries used by a broad developer community.
Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar.

Ways to stand out from the crowd

Open-source contributions in ML or data tooling.
Experience with multimodal generation or understanding (vision-language, document AI, video, or audio).
Building and optimizing scalable data pipelines for large-scale model training (throughput, distributed inference).
Experience generating data for agentic, tool-use, or reinforcement-learning post-training.

Compensation and benefits

Base salary range (Level 3): 168,000 USD - 264,500 USD.
Base salary range (Level 4): 192,000 USD - 304,750 USD.
Eligible for equity and benefits (link to NVIDIA benefits provided in the original posting).

Other information

Applications for this job will be accepted at least until June 14, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and is an equal opportunity employer. The company does not discriminate on the basis of protected characteristics.