Senior MLOps Engineer, GenAI Framework

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

System Administration @ 4 Ansible @ 4 Docker @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 7 R @ 4 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 TensorFlow @ 3 Communication @ 4 Jira @ 4 Debugging @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is seeking a dedicated and motivated senior build and continuous integration (CI/CD) engineer to join the GenAI Frameworks team working on Megatron-LM and NeMo Framework. These open-source, scalable, cloud-native frameworks support researchers and developers building Large Language Models (LLM), multimodal models, and video generation workflows, providing end-to-end model training (data curation, alignment, customization, evaluation, deployment) and tooling to optimize performance and user experience.

Responsibilities

Architect and manage continuous integration pipelines and release processes for Generative AI frameworks and libraries (Megatron-LM and NeMo Framework).
Design and implement efficient, scalable DevOps solutions to enable frequent high-quality releases while maximizing performance.
Work with industry-standard tools in hybrid on-premise and cloud environments (Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira).
Assist with cluster operations and system administration (servers, team accounts, clusters).
Automate recurring tasks such as accuracy and performance regression detection to accelerate R&D cycles.
Develop and advance quality control measures (code analysis, backwards compatibility, regression testing) and best practices.
Collaborate closely with DL frameworks and libraries teams (CUDA, cuDNN, cuBLAS, PyTorch) and other NVIDIA engineering teams providing software, testing, and release infrastructure.

Requirements

BS or MS degree in Computer Science, Computer Architecture, or related technical field (or equivalent experience) and 6+ years of industry experience in DevOps and infrastructure engineering.
Strong system-level programming skills in Python and shell scripting.
Extensive understanding of build/release systems and CI/CD; experience with solutions like GitLab, GitHub Actions, Jenkins, etc.
Experience with Linux system administration.
Proficient with containerization and cluster management technologies such as Docker and Kubernetes.
Experience with build tools including Make and CMake.
Strong background in source code management (SCM) solutions such as GitLab, GitHub, Perforce.
Strong problem-solving and debugging skills; ability to collaborate and influence in a dynamic environment.
Excellent interpersonal and written communication skills.

Ways to stand out

Proven track record with GPU-accelerated systems at scale.
Familiarity with deep learning frameworks such as PyTorch, Jax, or TensorFlow.
Expertise in cluster and cloud compute technologies (SLURM, Lustre, Kubernetes).
Experience with software and hardware benchmarking on high-performance computing systems.

Compensation & Other Details

Base salary range: 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until August 18, 2025.

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.