Senior Research Engineer, Foundation Model Training Infrastructure

at Nvidia
USD 224,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Kubernetes @ 4 Python @ 7 MLOps @ 8 TensorFlow @ 4 Hiring @ 4 Debugging @ 4 LLM @ 7 PyTorch @ 4 CUDA @ 7 GPU @ 4

Details

NVIDIA is hiring a senior or principal engineer to build cutting-edge infrastructure for large-scale foundation model training in the Generalist Embodied Agent Research (GEAR) group. The team leads Project GR00T, NVIDIA’s initiative to build foundation models and full-stack technology for humanoid robots. You will collaborate with researchers working on multimodal foundation models, large-scale robot learning, embodied AI, and physics simulation, contributing to impactful research projects and product roadmaps.

Responsibilities

  • Design and maintain large-scale distributed training systems to support multimodal foundation models for robotics.
  • Optimize GPU and cluster utilization for efficient model training and fine-tuning on massive datasets.
  • Implement scalable data loaders and preprocessors tailored for multimodal datasets (videos, text, sensor data).
  • Develop robust monitoring and debugging tools to ensure reliability and performance of training workflows on large GPU clusters.
  • Collaborate with researchers to integrate cutting-edge model architectures into scalable training pipelines.

Requirements

  • Bachelor’s degree in Computer Science, Robotics, Engineering, or a related field.
  • 10+ years of full-time industry experience in large-scale MLOps and AI infrastructure.
  • Proven experience designing and optimizing distributed training systems with frameworks such as PyTorch, JAX, or TensorFlow.
  • Deep understanding of GPU acceleration and CUDA programming.
  • Experience with cluster management tools like Kubernetes.
  • Strong programming skills in Python and a high-performance language such as C++.
  • Strong experience with large-scale GPU clusters, HPC environments, and job scheduling/orchestration tools (e.g., SLURM, Kubernetes).

Ways to stand out (Preferred)

  • Master’s or PhD in Computer Science, Robotics, Engineering, or related field.
  • Demonstrated technical lead experience coordinating engineering teams and driving projects from conception to deployment.
  • Strong experience building large-scale LLM and multimodal LLM training infrastructure.
  • Contributions to popular open-source AI frameworks or publications in top-tier AI conferences (NeurIPS, ICRA, ICLR, CoRL).

Compensation & Benefits

  • Base salary range: 224,000 USD - 356,500 USD (base salary determined by location, experience, and internal pay bands).
  • Eligible for equity and company benefits.

Additional Information

  • Application accepted at least until July 29, 2025.
  • NVIDIA is an equal opportunity employer committed to diversity and inclusion.