Senior ML Platform Engineer - Lepton

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Docker @ 6 Go @ 4 Kubernetes @ 6 Linux @ 4 IaC @ 7 Terraform @ 4 Python @ 4 CI/CD @ 4 TensorFlow @ 4 Networking @ 4 SRE @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4

Details

NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU is central to applications from generative AI to autonomous vehicles. This role will architect, build, and scale high-performance ML infrastructure using Infrastructure-as-Code practices to empower scientists and engineers to train and deploy advanced ML models on powerful GPU systems.

Responsibilities

  • Design, build, and maintain core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
  • Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
  • Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, following software engineering best practices.
  • Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline end-to-end experimentation.
  • Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.
  • Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, drive root cause analysis and implement preventative measures.
  • Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes.
  • Drive adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink).

Requirements

  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • 8+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
  • Strong proficiency with Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with experience building and managing production infrastructure.
  • SRE-oriented mindset with extensive experience diagnosing system-level issues, performance tuning, and ensuring platform reliability.
  • Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment.
  • Proficiency operating containerized workloads with Kubernetes and Docker.
  • Strong software engineering skills in Python or Go, with focus on automation, tooling, and writing production-grade code.
  • Experience with Linux systems internals, networking, and performance tuning at scale.

Ways to Stand Out / Preferred

  • Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
  • Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL).
  • Expertise with modern CI/CD methodologies and GitOps practices.
  • Passion for building developer-centric platforms with strong UX and operational reliability.
  • Proven ability to contribute code to complex orchestration or automation platforms.

Compensation & Benefits

  • Base salary ranges by level:
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • You will also be eligible for equity and benefits.

Other

  • Location provided: Santa Clara, CA (United States).
  • Applications for this job will be accepted at least until December 7, 2025.
  • NVIDIA is an equal opportunity employer committed to a diverse work environment.