Principal Site Reliability Engineer - Enterprise Ai Platform

at Nvidia

📍 Santa Clara, United States

USD 248,000-391,000 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Go @ 6 Jenkins @ 3 Kubernetes @ 4 Linux @ 7 Ruby @ 6 IaC @ 4 Python @ 6 GitHub @ 3 GitHub Actions @ 3 CI/CD @ 3 Distributed Systems @ 4 Hiring @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Perl @ 6 SRE @ 7 Agile @ 4 GPU @ 4

Details

NVIDIA is hiring a deeply technical and creative Site Reliability Engineer to build, support, and maintain next-generation AI-powered enterprise products that enhance engineering efficiency, data security, and product development. This role involves collaboration with Cloud and AI/ML teams in a dynamic, agile environment.

Responsibilities

Collaborate on translating business objectives into actionable plans
Address operational challenges, automate processes, and iterate for efficiency
Tackle systemic reliability issues with cross-functional teams
Monitor, optimize, and manage system performance and resources
Institute validated practices for reliability, remediation, and troubleshooting
Design, deploy, and automate production support, documenting essential knowledge
Navigate complex tasks with deep understanding of SRE principles
Lead cross-organizational projects from inception to completion
Mentor and train junior engineers
Serve as a subject matter expert in core team functions

Requirements

15+ years of experience in cloud, platform, or SRE roles
Bachelor’s or Master’s degree in Engineering, Computer Science, or related field or equivalent experience
Proficient in Python, Go, Perl, or Ruby
Hands-on experience scaling distributed systems in public, private, or hybrid cloud, on-prem 24x7x365 environments
Experience deploying applications in Kubernetes clusters with GPU and CPU pod scheduling
Managed micro-services related to AI platforms (Inference, Training, Evaluation, Ingestion)
Experience deploying, supporting, and supervising services, platforms, and application stacks
Familiarity with CI/CD tools such as Jenkins and GitHub Actions
Experience with Infrastructure as Code (IaC) tools and methodologies
Strong background working with MS Windows Server and/or Linux operating systems
Excellent communication skills, able to explain technical issues to non-technical audiences

Ways to Stand Out

Expertise with Azure and AWS cloud platforms
Passion and experience in AI methodologies
Strong software design and development background
Systematic problem-solving skills with strong ownership and communication

Benefits

Competitive salary range: $248,000 - $391,000 USD per year (base salary depends on location, experience, and market benchmarks)
Equity and additional benefits offered by NVIDIA

NVIDIA values diversity and is an equal opportunity employer committed to inclusive hiring and promotion practices.

#LI-Hybrid