Senior Research Engineer, Foundation Model Training Infrastructure
at Nvidia
USD 224,000-356,500 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 3
Python @ 7
MLOps @ 8
TensorFlow @ 4
Debugging @ 4
LLM @ 7
PyTorch @ 4
CUDA @ 7
GPU @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is searching for a senior or principal engineer who specializes in building cutting-edge infrastructure for large-scale foundation model training in the Generalist Embodied Agent Research (GEAR) group. The team leads Project GR00T, NVIDIA’s initiative to build foundation models and full-stack technology for humanoid robots. You will work with a collaborative research team producing influential work on multimodal foundation models, large-scale robot learning, embodied AI, and physics simulation (projects referenced include Eureka, VIMA, Voyager, MineDojo, MimicPlay, Prismer, and more).
Responsibilities
- Design and maintain large-scale distributed training systems to support multi-modal foundation models for robotics.
- Optimize GPU and cluster utilization for efficient model training and fine-tuning on massive datasets.
- Implement scalable data loaders and preprocessors tailored for multimodal datasets, including videos, text, and sensor data.
- Develop robust monitoring and debugging tools to ensure the reliability and performance of training workflows on large GPU clusters.
- Collaborate with researchers to integrate cutting-edge model architectures into scalable training pipelines.
Requirements
- Bachelor’s degree in Computer Science, Robotics, Engineering, or a related field.
- 10+ years of full-time industry experience in large-scale MLOps and AI infrastructure.
- Proven experience designing and optimizing distributed training systems using frameworks such as PyTorch, JAX, or TensorFlow.
- Deep understanding of GPU acceleration and CUDA programming.
- Experience with cluster management and orchestration tools such as Kubernetes; familiarity with job schedulers like SLURM is expected.
- Strong programming skills in Python and a high-performance language such as C++ for efficient system development.
- Strong experience with large-scale GPU clusters, HPC environments, and job scheduling/orchestration tools.
Preferred / Ways to Stand Out
- Master’s or PhD in Computer Science, Robotics, Engineering, or a related field.
- Demonstrated Tech Lead experience coordinating teams and driving projects from conception to deployment.
- Strong experience building large-scale LLM and multimodal LLM training infrastructure.
- Contributions to popular open-source AI frameworks or research publications in top-tier AI conferences (NeurIPS, ICRA, ICLR, CoRL).
Compensation & Benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and additional benefits (see NVIDIA benefits pages).
Additional Information
- Location: Santa Clara, California, United States.
- Employment type: Full time.
- Applications for this job will be accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.