Senior ML Platform Engineer

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

IaC Machine Learning

Required Skills & Competences

Ansible @ 4 Docker @ 6 Go @ 4 Kubernetes @ 6 Linux @ 4 Terraform @ 4 Python @ 4 CI/CD @ 4 TensorFlow @ 4 Networking @ 4 SRE @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4 AI @ 4 NCCL @ 7 NVLink @ 4

Details

NVIDIA is seeking a Senior ML Platform Engineer to architect, build, and scale high-performance ML infrastructure. The role focuses on Infrastructure-as-Code, SRE practices, automation, and enabling researchers and engineers to train and deploy advanced ML models on large GPU systems.

Responsibilities

  • Design, build, and maintain core ML platform infrastructure as code (primarily Ansible and Terraform) for large-scale, distributed GPU clusters.
  • Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the stack, ensuring high availability and performance for critical AI workloads.
  • Develop internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations with software engineering best practices.
  • Collaborate with ML researchers and applied scientists to understand infrastructure needs and streamline end-to-end experimentation.
  • Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments; implement monitoring, alerting, and incident response protocols.
  • Participate in on-call rotation to support platform services and infrastructure, drive root cause analysis, and implement preventative measures.
  • Write high-quality, maintainable code (Python, Go) to contribute to orchestration platforms and automate manual processes.
  • Drive adoption and integration of modern GPU technologies into ML pipelines (examples: GB200, NVLink).

Requirements

  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
  • Strong proficiency with Infrastructure-as-Code tools, specifically Ansible and Terraform, with production experience.
  • SRE-oriented mindset with experience diagnosing system-level issues, performance tuning, and ensuring platform reliability.
  • Solid understanding of ML workflows and lifecycle (data preprocessing through deployment).
  • Proficiency in operating containerized workloads with Kubernetes and Docker.
  • Strong software engineering skills in Python and/or Go, focusing on automation, tooling, and production-grade code.
  • Experience with Linux systems internals, networking, and performance tuning at scale.

Ways To Stand Out

  • Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
  • Deep understanding of distributed training techniques (data/model parallelism, Horovod, NCCL).
  • Expertise with modern CI/CD methodologies and GitOps practices.
  • Passion for developer-centric platforms with great UX and strong operational reliability.
  • Proven ability to contribute code to complex orchestration or automation platforms.

Compensation & Benefits

  • Base salary ranges by level:
    • Level 3: 152000 USD - 241500 USD
    • Level 4: 184000 USD - 287500 USD
  • Eligible for equity and company benefits (see NVIDIA benefits).

Additional Information

  • Full-time role. Applications accepted at least until June 9, 2026.
  • This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and committed to an inclusive work environment.