Senior MLOps Engineer, GenAI Framework
at Nvidia
π Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
System Administration @ 4 Ansible @ 4 Docker @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 7 R @ 4 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 TensorFlow @ 3 Communication @ 4 Jira @ 4 Debugging @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is seeking a dedicated and motivated senior build and continuous integration (CI/CD) engineer to join the GenAI Frameworks team working on Megatron-LM and NeMo Framework. These open-source, scalable, cloud-native frameworks support researchers and developers building Large Language Models (LLM), multimodal models, and video generation workflows, providing end-to-end model training (data curation, alignment, customization, evaluation, deployment) and tooling to optimize performance and user experience.
Responsibilities
- Architect and manage continuous integration pipelines and release processes for Generative AI frameworks and libraries (Megatron-LM and NeMo Framework).
- Design and implement efficient, scalable DevOps solutions to enable frequent high-quality releases while maximizing performance.
- Work with industry-standard tools in hybrid on-premise and cloud environments (Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira).
- Assist with cluster operations and system administration (servers, team accounts, clusters).
- Automate recurring tasks such as accuracy and performance regression detection to accelerate R&D cycles.
- Develop and advance quality control measures (code analysis, backwards compatibility, regression testing) and best practices.
- Collaborate closely with DL frameworks and libraries teams (CUDA, cuDNN, cuBLAS, PyTorch) and other NVIDIA engineering teams providing software, testing, and release infrastructure.
Requirements
- BS or MS degree in Computer Science, Computer Architecture, or related technical field (or equivalent experience) and 6+ years of industry experience in DevOps and infrastructure engineering.
- Strong system-level programming skills in Python and shell scripting.
- Extensive understanding of build/release systems and CI/CD; experience with solutions like GitLab, GitHub Actions, Jenkins, etc.
- Experience with Linux system administration.
- Proficient with containerization and cluster management technologies such as Docker and Kubernetes.
- Experience with build tools including Make and CMake.
- Strong background in source code management (SCM) solutions such as GitLab, GitHub, Perforce.
- Strong problem-solving and debugging skills; ability to collaborate and influence in a dynamic environment.
- Excellent interpersonal and written communication skills.
Ways to stand out
- Proven track record with GPU-accelerated systems at scale.
- Familiarity with deep learning frameworks such as PyTorch, Jax, or TensorFlow.
- Expertise in cluster and cloud compute technologies (SLURM, Lustre, Kubernetes).
- Experience with software and hardware benchmarking on high-performance computing systems.
Compensation & Other Details
- Base salary range: 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.
- You will also be eligible for equity and benefits.
- Applications for this job will be accepted at least until August 18, 2025.
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.