Principal Site Reliability Engineer - Enterprise AI Platform

at Nvidia

📍 Santa Clara, United States

USD 248,000-391,000 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Go @ 6 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 Ruby @ 6 IaC @ 4 Python @ 6 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 Distributed Systems @ 4 Hiring @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 7 Mentoring @ 4 Perl @ 6 SRE @ 8 Microservices @ 4 TAG @ 4 Agile @ 4 GPU @ 4

Details

NVIDIA is hiring a deeply technical and creative Site Reliability Engineer to build, support, and maintain next-generation AI-powered enterprise products that improve engineering efficiency, data security, and product development. The role collaborates with Cloud and AI/ML teams in a dynamic, agile environment and includes leadership responsibilities across projects and mentoring junior engineers.

Responsibilities

Translate business objectives into actionable operational and engineering plans.
Address operational challenges, automate processes, and iterate for efficiency.
Tackle systemic reliability issues with multi-functional teams.
Monitor, optimize, and manage system performance and resource usage.
Institute validated practices for reliability, remediation, and troubleshooting.
Design, deploy, and automate production support; document essential knowledge.
Lead cross-organizational projects from inception to completion.
Mentor and train junior engineers; serve as a subject matter expert for core team functions.

Requirements

15+ years of working experience in cloud, platform, or SRE roles.
Bachelor's or Master's Degree in Engineering, Computer Science, or a related field, or equivalent experience.
Proficient in one or more programming languages: Python, Go, Perl, or Ruby.
Hands-on experience handling and scaling distributed systems in public, private, hybrid cloud, or on-prem environments (24x7x365 operations).
Experience delivering software and deploying applications in Kubernetes clusters, including GPU and CPU pod scheduling (ability to understand on-prem deployments).
Experience maintaining and managing microservices related to AI platforms (inference, training, evaluation, ingestion).
Hands-on experience deploying, supporting, and supervising new and existing services, platforms, and application stacks.
Experience with CI/CD systems such as Jenkins, GitHub Actions, etc.
Background with Infrastructure as Code (IaC) methodologies and relevant tools.
Extensive experience working with Microsoft Windows Server and/or Linux operating systems.
Strong communication skills, able to articulate technical issues to non-technical audiences.

Ways to stand out

Cloud expertise in Azure and AWS.
Passion and experience in AI methodologies.
Strong background in software design and development.
Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive.

Compensation & Additional Information

Base salary range: 248,000 USD - 391,000 USD (determined by location, experience, and comparable roles).
Eligible for equity and benefits.
Applications accepted at least until July 29, 2025.
Note: Job listing includes the tag "#LI-Hybrid" indicating a hybrid work arrangement.