Senior ML Platform Engineer - Lepton

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 Docker @ 6 Go @ 4 Kubernetes @ 6 Linux @ 4 IaC @ 7 Terraform @ 4 Python @ 4 CI/CD @ 4 TensorFlow @ 4 Networking @ 4 SRE @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4

Details

NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU is central to applications from generative AI to autonomous vehicles. This role will architect, build, and scale high-performance ML infrastructure using Infrastructure-as-Code practices to empower scientists and engineers to train and deploy advanced ML models on powerful GPU systems.

Responsibilities

Design, build, and maintain core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, following software engineering best practices.
Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline end-to-end experimentation.
Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.
Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, drive root cause analysis and implement preventative measures.
Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes.
Drive adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink).

Requirements

BS/MS in Computer Science, Engineering, or equivalent experience.
8+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
Strong proficiency with Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with experience building and managing production infrastructure.
SRE-oriented mindset with extensive experience diagnosing system-level issues, performance tuning, and ensuring platform reliability.
Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment.
Proficiency operating containerized workloads with Kubernetes and Docker.
Strong software engineering skills in Python or Go, with focus on automation, tooling, and writing production-grade code.
Experience with Linux systems internals, networking, and performance tuning at scale.

Ways to Stand Out / Preferred

Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL).
Expertise with modern CI/CD methodologies and GitOps practices.
Passion for building developer-centric platforms with strong UX and operational reliability.
Proven ability to contribute code to complex orchestration or automation platforms.

Compensation & Benefits

Base salary ranges by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
You will also be eligible for equity and benefits.

Other

Location provided: Santa Clara, CA (United States).
Applications for this job will be accepted at least until December 7, 2025.
NVIDIA is an equal opportunity employer committed to a diverse work environment.