Principal Site Reliability Engineer - Enterprise AI Platform

at Nvidia
USD 248,000-391,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Go @ 6 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 Ruby @ 6 IaC @ 4 Python @ 6 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 Distributed Systems @ 4 Hiring @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 7 Mentoring @ 4 Perl @ 6 SRE @ 8 Microservices @ 4 TAG @ 4 Agile @ 4 GPU @ 4

Details

NVIDIA is hiring a deeply technical and creative Site Reliability Engineer to build, support, and maintain next-generation AI-powered enterprise products that improve engineering efficiency, data security, and product development. The role collaborates with Cloud and AI/ML teams in a dynamic, agile environment and includes leadership responsibilities across projects and mentoring junior engineers.

Responsibilities

  • Translate business objectives into actionable operational and engineering plans.
  • Address operational challenges, automate processes, and iterate for efficiency.
  • Tackle systemic reliability issues with multi-functional teams.
  • Monitor, optimize, and manage system performance and resource usage.
  • Institute validated practices for reliability, remediation, and troubleshooting.
  • Design, deploy, and automate production support; document essential knowledge.
  • Lead cross-organizational projects from inception to completion.
  • Mentor and train junior engineers; serve as a subject matter expert for core team functions.

Requirements

  • 15+ years of working experience in cloud, platform, or SRE roles.
  • Bachelor's or Master's Degree in Engineering, Computer Science, or a related field, or equivalent experience.
  • Proficient in one or more programming languages: Python, Go, Perl, or Ruby.
  • Hands-on experience handling and scaling distributed systems in public, private, hybrid cloud, or on-prem environments (24x7x365 operations).
  • Experience delivering software and deploying applications in Kubernetes clusters, including GPU and CPU pod scheduling (ability to understand on-prem deployments).
  • Experience maintaining and managing microservices related to AI platforms (inference, training, evaluation, ingestion).
  • Hands-on experience deploying, supporting, and supervising new and existing services, platforms, and application stacks.
  • Experience with CI/CD systems such as Jenkins, GitHub Actions, etc.
  • Background with Infrastructure as Code (IaC) methodologies and relevant tools.
  • Extensive experience working with Microsoft Windows Server and/or Linux operating systems.
  • Strong communication skills, able to articulate technical issues to non-technical audiences.

Ways to stand out

  • Cloud expertise in Azure and AWS.
  • Passion and experience in AI methodologies.
  • Strong background in software design and development.
  • Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive.

Compensation & Additional Information

  • Base salary range: 248,000 USD - 391,000 USD (determined by location, experience, and comparable roles).
  • Eligible for equity and benefits.
  • Applications accepted at least until July 29, 2025.
  • Note: Job listing includes the tag "#LI-Hybrid" indicating a hybrid work arrangement.