Engineering Manager - AI DevOps

at Nvidia
USD 224,000-425,500 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 3 Software Development @ 6 Ansible @ 3 Docker @ 2 Grafana @ 3 Kubernetes @ 3 Prometheus @ 3 DevOps @ 3 IaC @ 3 Terraform @ 3 Python @ 5 GCP @ 5 GitHub @ 3 GitHub Actions @ 3 CI/CD @ 3 Leadership @ 5 AWS @ 5 Azure @ 5 SRE @ 5 CloudFormation @ 5 Rust @ 5 Technical Leadership @ 6 Compliance @ 3 CUDA @ 6 GPU @ 3

Details

NVIDIA is looking for an outstanding AI DevOps Engineering Manager to lead and expand next-generation inference operations infrastructure. You will help transform AI inference delivery and support NVIDIA products such as Dynamo, Triton, NIXL, and other AI inference solutions. This role is a core part of the GitHub First initiative, enabling public CI/CD infrastructure with GPU and Kubernetes capabilities to deliver high-throughput, low-latency inferencing solutions in distributed environments. You will lead a team to ensure NVIDIA AI products achieve outstanding performance and reliability worldwide.

Responsibilities

  • Supervise a team of DevOps engineers with expertise in AI inference infrastructure, test automation (SDET), and Infrastructure as Code (IaC).
  • Architect and implement scalable test automation strategies for AI inference workloads, including performance benchmarking and automated quality gates.
  • Lead maintenance of GitHub First public CI infrastructure, focusing on single/multi-GPU testing, Kubernetes multi-node GPU testing, and cloud service provider (CSP) validation.
  • Drive Infrastructure as Code efforts using Terraform, Ansible, and Kubernetes to support scaling across multiple clouds and lead GPU cluster operations.
  • Attain operational proficiency including 24x7 on-call rotations, SRE methodologies, automated monitoring, and self-repairing systems to guarantee uptime exceeding 99.9%.
  • Lead release coordination, cost optimization, and management of multi-cloud deployments.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or equivalent experience.
  • 4+ years leading DevOps/SRE organizations with direct SDET leadership experience.
  • 8+ years hands-on experience in software development, test automation, or infrastructure engineering with AI/ML or GPU-intensive workloads.
  • Proficiency with Infrastructure as Code platforms (Terraform, Ansible, or CloudFormation) and exposure to multiple cloud environments (AWS, GCP, Azure, OCI).
  • Strong technical leadership in test automation frameworks, CI/CD pipeline development, and quality engineering practices.
  • Familiarity with containerization and orchestration tools such as Docker and Kubernetes for AI/ML workloads and GPU resource management.
  • Proven success building and scaling teams in fast-paced, high-growth environments.
  • Effective interpersonal skills to collaborate with remote teams and build consensus.
  • Proficiency in Python, Rust, or related programming languages and ability to engage in architecture discussions.
  • Demonstrated operational proficiency with 24x7 on-call oversight, SRE methodologies, and building robust high-availability infrastructures.

Ways to stand out

  • Experience with CI/CD (specifically GitHub Actions) and releasing open-source AI software.
  • Deep AI/ML infrastructure expertise, especially with NVIDIA technologies such as CUDA, TensorRT, Dynamo, and Triton Inference Server; experience coordinating GPU cluster operations and GPU workload performance benchmarking.
  • Background in DevOps and system software testing, and previous experience leading teams working on inference engines, model serving platforms, or AI acceleration frameworks.
  • Experience with monitoring tools (Prometheus, Grafana), security scanning, static/dynamic analysis tools, and automation for license compliance in AI inferencing frameworks.

Compensation & Benefits

  • Base salary ranges by level:
    • Level 3: 224,000 USD - 356,500 USD
    • Level 4: 272,000 USD - 425,500 USD
  • You will also be eligible for equity and benefits (see NVIDIA benefits information at https://www.nvidia.com/en-us/benefits/).

Other

  • Applications accepted at least until September 29, 2025.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.