Used Tools & Technologies
Not specified
Required Skills & Competences ?
System Administration @ 4 Ansible @ 4 Docker @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 7 R @ 4 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 MLOps @ 4 TensorFlow @ 3 Communication @ 7 Jira @ 4 Debugging @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA seeks a Senior MLOps Engineer for the GenAI Frameworks team (Megatron-LM and NeMo Framework). The role focuses on architecting and managing build and CI/CD pipelines for open-source, cloud-native frameworks used for Large Language Models (LLM), multimodal, and video generation workflows. The position works across the full deep learning software stack to enable framework engineers, algorithm engineers, and research scientists to develop, optimize, test, and release high-performance software.
Responsibilities
- Architect and manage continuous integration pipelines and release processes for Megatron-LM and NeMo Framework and associated libraries.
- Design and implement efficient, scalable DevOps solutions to enable frequent, high-quality releases while maintaining performance.
- Work with industry-standard tools and platforms including Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, and Jira in hybrid on-premise and cloud environments.
- Assist with cluster operations and system administration (servers, team accounts, clusters).
- Automate recurring tasks to accelerate R&D cycles, such as accuracy and performance regression detection.
- Develop and advance quality control measures: code analysis, backwards compatibility checks, regression testing, and best practices.
- Collaborate closely with teams responsible for DL libraries and frameworks (CUDA, cuDNN, cuBLAS, PyTorch) and other engineering teams providing testing and release infrastructure.
Requirements
- BS or MS in Computer Science, Computer Architecture, or related technical field (or equivalent experience) and 6+ years of industry experience in DevOps and infrastructure engineering.
- Strong system-level programming skills in Python and shell scripting.
- Extensive understanding of build/release systems, CI/CD concepts, and hands-on experience with GitLab, GitHub, Jenkins, or similar solutions.
- Experience with Linux system administration.
- Proficiency with containerization and cluster management technologies such as Docker and Kubernetes.
- Experience with build tools including Make and CMake.
- Strong background in source code management (SCM) solutions such as GitLab, GitHub, Perforce.
- Strong problem-solving and debugging skills; excellent collaboration and communication abilities.
Nice to have / Ways to stand out
- Proven track record with GPU-accelerated systems at scale and high-performance computing (HPC) environments.
- Familiarity with deep learning frameworks such as PyTorch, JAX, or TensorFlow.
- Expertise in cluster and cloud compute technologies (SLURM, Lustre, Kubernetes at scale).
- Experience with software and hardware benchmarking on HPC systems.
- Experience working with CUDA, cuDNN, cuBLAS and optimizing performance of DL workloads.
Compensation & Benefits
- Base salary ranges by level (determined by location and experience):
- Level 4: 184,000 USD - 287,500 USD per year
- Level 5: 224,000 USD - 356,500 USD per year
- Eligible for equity and additional benefits (see NVIDIA benefits page).
Other details
- Location: Santa Clara, CA, United States (on-site as specified).
- Employment type: Full time.
- Applications accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer and values diversity; does not discriminate based on protected characteristics.