Vacancy is archived. Applications are no longer accepted.

Principal Site Reliability Engineer, AI Infrastructure

at Nvidia

📍 Santa Clara, United States

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 TensorFlow @ 3 AWS @ 4 Azure @ 4 Communication @ 7 Mentoring @ 4 SRE @ 7 KubeFlow @ 3 Rust @ 6 System Architecture @ 4 PyTorch @ 3 GPU @ 4

Details

NVIDIA is a technology leader in AI, accelerated computing, and graphics. This role will architect, lead, and scale globally distributed production systems that support AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments. The position emphasizes automation, reliability, observability, and mentoring/leading global teams to deliver resilient, high-velocity platforms for GPU-heavy AI workloads.

Responsibilities

Architect, lead, and scale globally distributed production systems for AI/ML, HPC, and engineering platforms across hybrid and multi-cloud environments.
Design and implement automation frameworks to reduce manual work, improve resilience, and standardize system health, change safety, and release velocity processes.
Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing for distributed systems.
Lead cross-organizational efforts to assess operational maturity, mitigate systemic risks, and establish long-term reliability strategies with engineering, infrastructure, and product teams.
Influence NVIDIA's AI platform roadmap through co-development with internal partners and external vendors; stay current with academic and industry advances.
Publish technical insights (papers, patents, whitepapers) and drive production engineering and system design innovation.
Lead and mentor global teams technically, contribute to recruitment and design reviews, and develop standard methodologies for incident response, observability, and system architecture.

Requirements

15+ years of experience in Site Reliability Engineering (SRE), Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs.
Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).
Expert-level programming in Python and proficiency in one or more of C++, Go, or Rust.
Demonstrated experience operating Kubernetes at scale, including CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production.
Hands-on expertise with observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code tools (Terraform, CDK, Pulumi).
Proficiency in SRE concepts such as error budgets, SLOs, distributed tracing, and architectural fault tolerance.
Strong written and verbal communication skills with ability to influence cross-functional collaborators and drive technical decisions.
Proven ability to complete long-term, forward-looking platform strategies.
Degree in Computer Science or related field, or equivalent experience.

Ways to Stand Out (Preferred / Nice-to-have)

Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.
Familiarity with deep learning frameworks (PyTorch, TensorFlow, JAX) and orchestration frameworks (Ray, Kubeflow).
Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.
Experience leading operational readiness and reliability engineering in GPU-heavy environments.
Track record of improving incident management culture, root cause analysis, and postmortem processes across large teams.

Compensation & Benefits

Base salary range: 272,000 USD - 425,500 USD (final base depends on location, experience, and internal pay comparisons).
You will also be eligible for equity and a comprehensive benefits package.

Additional Information

Applications for this job will be accepted at least until August 3, 2025.
NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.