Senior MLOps Engineer, GenAI Framework

at Nvidia
USD 184,000-356,500 per year
SENIOR
βœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

System Administration @ 4 Ansible @ 4 Docker @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 7 R @ 4 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 TensorFlow @ 3 Communication @ 4 Jira @ 4 Debugging @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is seeking a dedicated and motivated senior build and continuous integration (CI/CD) engineer to join the GenAI Frameworks team working on Megatron-LM and NeMo Framework. These open-source, scalable, cloud-native frameworks support researchers and developers building Large Language Models (LLM), multimodal models, and video generation workflows, providing end-to-end model training (data curation, alignment, customization, evaluation, deployment) and tooling to optimize performance and user experience.

Responsibilities

  • Architect and manage continuous integration pipelines and release processes for Generative AI frameworks and libraries (Megatron-LM and NeMo Framework).
  • Design and implement efficient, scalable DevOps solutions to enable frequent high-quality releases while maximizing performance.
  • Work with industry-standard tools in hybrid on-premise and cloud environments (Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira).
  • Assist with cluster operations and system administration (servers, team accounts, clusters).
  • Automate recurring tasks such as accuracy and performance regression detection to accelerate R&D cycles.
  • Develop and advance quality control measures (code analysis, backwards compatibility, regression testing) and best practices.
  • Collaborate closely with DL frameworks and libraries teams (CUDA, cuDNN, cuBLAS, PyTorch) and other NVIDIA engineering teams providing software, testing, and release infrastructure.

Requirements

  • BS or MS degree in Computer Science, Computer Architecture, or related technical field (or equivalent experience) and 6+ years of industry experience in DevOps and infrastructure engineering.
  • Strong system-level programming skills in Python and shell scripting.
  • Extensive understanding of build/release systems and CI/CD; experience with solutions like GitLab, GitHub Actions, Jenkins, etc.
  • Experience with Linux system administration.
  • Proficient with containerization and cluster management technologies such as Docker and Kubernetes.
  • Experience with build tools including Make and CMake.
  • Strong background in source code management (SCM) solutions such as GitLab, GitHub, Perforce.
  • Strong problem-solving and debugging skills; ability to collaborate and influence in a dynamic environment.
  • Excellent interpersonal and written communication skills.

Ways to stand out

  • Proven track record with GPU-accelerated systems at scale.
  • Familiarity with deep learning frameworks such as PyTorch, Jax, or TensorFlow.
  • Expertise in cluster and cloud compute technologies (SLURM, Lustre, Kubernetes).
  • Experience with software and hardware benchmarking on high-performance computing systems.

Compensation & Other Details

  • Base salary range: 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.
  • You will also be eligible for equity and benefits.
  • Applications for this job will be accepted at least until August 18, 2025.

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.