Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 Docker @ 6 Go @ 4 Kubernetes @ 6 Linux @ 4 IaC @ 7 Terraform @ 4 Python @ 4 CI/CD @ 4 TensorFlow @ 4 Networking @ 4 SRE @ 4 Experimentation @ 4 PyTorch @ 4 GPU @ 4Details
NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU is central to applications from generative AI to autonomous vehicles. This role will architect, build, and scale high-performance ML infrastructure using Infrastructure-as-Code practices to empower scientists and engineers to train and deploy advanced ML models on powerful GPU systems.
Responsibilities
- Design, build, and maintain core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
- Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
- Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, following software engineering best practices.
- Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline end-to-end experimentation.
- Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.
- Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, drive root cause analysis and implement preventative measures.
- Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes.
- Drive adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink).
Requirements
- BS/MS in Computer Science, Engineering, or equivalent experience.
- 8+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
- Strong proficiency with Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with experience building and managing production infrastructure.
- SRE-oriented mindset with extensive experience diagnosing system-level issues, performance tuning, and ensuring platform reliability.
- Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment.
- Proficiency operating containerized workloads with Kubernetes and Docker.
- Strong software engineering skills in Python or Go, with focus on automation, tooling, and writing production-grade code.
- Experience with Linux systems internals, networking, and performance tuning at scale.
Ways to Stand Out / Preferred
- Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
- Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL).
- Expertise with modern CI/CD methodologies and GitOps practices.
- Passion for building developer-centric platforms with strong UX and operational reliability.
- Proven ability to contribute code to complex orchestration or automation platforms.
Compensation & Benefits
- Base salary ranges by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits.
Other
- Location provided: Santa Clara, CA (United States).
- Applications for this job will be accepted at least until December 7, 2025.
- NVIDIA is an equal opportunity employer committed to a diverse work environment.