Senior MLOps Engineer, GenAI Framework

at Nvidia
USD 152,000-241,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

System Administration @ 4 Ansible @ 4 Docker @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 7 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 TensorFlow @ 3 Hiring @ 4 Communication @ 4 Jira @ 4 Debugging @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 GenAI @ 4 Slurm @ 4

Details

NVIDIA is hiring a build and continuous integration (CI/CD) engineer to join the GenAI Frameworks team (Megatron-LM and NeMo Framework). These are open-source, scalable, cloud-native frameworks for Large Language Models (LLM), multimodal, and video generation workloads. The role focuses on enabling framework engineers, deep learning algorithm engineers, and research scientists to deliver high-quality, high-performance software by developing and maintaining CI/CD, build/release processes, automation, and cluster/ infrastructure tooling.

Responsibilities

  • Develop and maintain continuous integration pipelines and release processes for Megatron-LM and NeMo Framework.
  • Implement efficient, scalable DevOps solutions to enable more frequent, high-quality releases while maintaining performance.
  • Work with industry-standard tools in hybrid on-premise and cloud environments: Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira.
  • Assist with cluster operations and system administration (managing servers, team accounts, clusters).
  • Automate recurring tasks to accelerate R&D cycles, such as accuracy and performance regression detection.
  • Develop quality control measures (code analysis, backwards compatibility, regression testing) and advance best practices.
  • Collaborate closely with teams working on DL frameworks and libraries (CUDA, cuDNN, cuBLAS, PyTorch) and other NVIDIA engineering teams providing software, testing, and release infrastructure.

Requirements

  • BS or MS in Computer Science, Computer Architecture or related technical field (or equivalent experience).
  • 3+ years of industry experience in DevOps and infrastructure engineering.
  • Strong system-level programming skills in Python and shell scripting.
  • Experience with build/release systems and CI/CD (GitLab, GitHub, Jenkins, etc.).
  • Experience with Linux system administration.
  • Experience with containerization and cluster management (Docker, Kubernetes).
  • Experience with build tools including Make and CMake.
  • Strong background in source code management (GitLab, GitHub, Perforce, etc.).
  • Strong problem-solving and debugging skills.
  • Good collaboration, interpersonal, and written communication skills.

Ways to stand out

  • Proven track record with GPU-accelerated systems at scale.
  • Familiarity with deep learning frameworks such as PyTorch, JAX, or TensorFlow.
  • Expertise in cluster and cloud compute technologies (e.g., SLURM, Lustre, Kubernetes).
  • Experience in software and hardware benchmarking on high-performance computing systems.

Compensation and benefits

  • Base salary range: 152,000 USD - 241,500 USD (determined based on location, experience, and pay of employees in similar positions).
  • Eligible for equity and company benefits (link to NVIDIA benefits referenced in posting).

Additional information

  • Applications will be accepted at least until February 23, 2026.
  • This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and states non-discrimination across a range of protected characteristics.