Used Tools & Technologies
IaC Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Ansible @ 4
Docker @ 6
Go @ 4
Kubernetes @ 6
Linux @ 4
Terraform @ 4
Python @ 4
CI/CD @ 4
TensorFlow @ 4
Networking @ 4
SRE @ 4
Experimentation @ 4
PyTorch @ 4
GPU @ 4
AI @ 4
NCCL @ 7
NVLink @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a Senior ML Platform Engineer to architect, build, and scale high-performance ML infrastructure. The role focuses on Infrastructure-as-Code, SRE practices, automation, and enabling researchers and engineers to train and deploy advanced ML models on large GPU systems.
Responsibilities
- Design, build, and maintain core ML platform infrastructure as code (primarily Ansible and Terraform) for large-scale, distributed GPU clusters.
- Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the stack, ensuring high availability and performance for critical AI workloads.
- Develop internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations with software engineering best practices.
- Collaborate with ML researchers and applied scientists to understand infrastructure needs and streamline end-to-end experimentation.
- Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments; implement monitoring, alerting, and incident response protocols.
- Participate in on-call rotation to support platform services and infrastructure, drive root cause analysis, and implement preventative measures.
- Write high-quality, maintainable code (Python, Go) to contribute to orchestration platforms and automate manual processes.
- Drive adoption and integration of modern GPU technologies into ML pipelines (examples: GB200, NVLink).
Requirements
- BS/MS in Computer Science, Engineering, or equivalent experience.
- 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
- Strong proficiency with Infrastructure-as-Code tools, specifically Ansible and Terraform, with production experience.
- SRE-oriented mindset with experience diagnosing system-level issues, performance tuning, and ensuring platform reliability.
- Solid understanding of ML workflows and lifecycle (data preprocessing through deployment).
- Proficiency in operating containerized workloads with Kubernetes and Docker.
- Strong software engineering skills in Python and/or Go, focusing on automation, tooling, and production-grade code.
- Experience with Linux systems internals, networking, and performance tuning at scale.
Ways To Stand Out
- Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
- Deep understanding of distributed training techniques (data/model parallelism, Horovod, NCCL).
- Expertise with modern CI/CD methodologies and GitOps practices.
- Passion for developer-centric platforms with great UX and strong operational reliability.
- Proven ability to contribute code to complex orchestration or automation platforms.
Compensation & Benefits
- Base salary ranges by level:
- Level 3: 152000 USD - 241500 USD
- Level 4: 184000 USD - 287500 USD
- Eligible for equity and company benefits (see NVIDIA benefits).
Additional Information
- Full-time role. Applications accepted at least until June 9, 2026.
- This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to an inclusive work environment.