Senior ML Platform Engineer

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

IaC Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Ansible @ 4 Docker @ 6 Go @ 4 Kubernetes @ 6 Linux @ 4 Terraform @ 4 Python @ 4 CI/CD @ 4 TensorFlow @ 4 Networking @ 4 SRE @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4 AI @ 4 NCCL @ 7 NVLink @ 4

Details

NVIDIA is seeking a Senior ML Platform Engineer to architect, build, and scale high-performance ML infrastructure. The role focuses on Infrastructure-as-Code, SRE practices, automation, and enabling researchers and engineers to train and deploy advanced ML models on large GPU systems.

Responsibilities

Design, build, and maintain core ML platform infrastructure as code (primarily Ansible and Terraform) for large-scale, distributed GPU clusters.
Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the stack, ensuring high availability and performance for critical AI workloads.
Develop internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations with software engineering best practices.
Collaborate with ML researchers and applied scientists to understand infrastructure needs and streamline end-to-end experimentation.
Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments; implement monitoring, alerting, and incident response protocols.
Participate in on-call rotation to support platform services and infrastructure, drive root cause analysis, and implement preventative measures.
Write high-quality, maintainable code (Python, Go) to contribute to orchestration platforms and automate manual processes.
Drive adoption and integration of modern GPU technologies into ML pipelines (examples: GB200, NVLink).

Requirements

BS/MS in Computer Science, Engineering, or equivalent experience.
5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
Strong proficiency with Infrastructure-as-Code tools, specifically Ansible and Terraform, with production experience.
SRE-oriented mindset with experience diagnosing system-level issues, performance tuning, and ensuring platform reliability.
Solid understanding of ML workflows and lifecycle (data preprocessing through deployment).
Proficiency in operating containerized workloads with Kubernetes and Docker.
Strong software engineering skills in Python and/or Go, focusing on automation, tooling, and production-grade code.
Experience with Linux systems internals, networking, and performance tuning at scale.

Ways To Stand Out

Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
Deep understanding of distributed training techniques (data/model parallelism, Horovod, NCCL).
Expertise with modern CI/CD methodologies and GitOps practices.
Passion for developer-centric platforms with great UX and strong operational reliability.
Proven ability to contribute code to complex orchestration or automation platforms.

Compensation & Benefits

Base salary ranges by level:
- Level 3: 152000 USD - 241500 USD
- Level 4: 184000 USD - 287500 USD
Eligible for equity and company benefits (see NVIDIA benefits).

Additional Information

Full-time role. Applications accepted at least until June 9, 2026.
This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and committed to an inclusive work environment.