Principal Site Reliability Engineer - Enterprise Ai Platform

at Nvidia
USD 248,000-391,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Go @ 6 Jenkins @ 3 Kubernetes @ 4 Linux @ 7 Ruby @ 6 IaC @ 4 Python @ 6 GitHub @ 3 GitHub Actions @ 3 CI/CD @ 3 Distributed Systems @ 4 Hiring @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Perl @ 6 SRE @ 7 Agile @ 4 GPU @ 4

Details

NVIDIA is hiring a deeply technical and creative Site Reliability Engineer to build, support, and maintain next-generation AI-powered enterprise products that enhance engineering efficiency, data security, and product development. This role involves collaboration with Cloud and AI/ML teams in a dynamic, agile environment.

Responsibilities

  • Collaborate on translating business objectives into actionable plans
  • Address operational challenges, automate processes, and iterate for efficiency
  • Tackle systemic reliability issues with cross-functional teams
  • Monitor, optimize, and manage system performance and resources
  • Institute validated practices for reliability, remediation, and troubleshooting
  • Design, deploy, and automate production support, documenting essential knowledge
  • Navigate complex tasks with deep understanding of SRE principles
  • Lead cross-organizational projects from inception to completion
  • Mentor and train junior engineers
  • Serve as a subject matter expert in core team functions

Requirements

  • 15+ years of experience in cloud, platform, or SRE roles
  • Bachelor’s or Master’s degree in Engineering, Computer Science, or related field or equivalent experience
  • Proficient in Python, Go, Perl, or Ruby
  • Hands-on experience scaling distributed systems in public, private, or hybrid cloud, on-prem 24x7x365 environments
  • Experience deploying applications in Kubernetes clusters with GPU and CPU pod scheduling
  • Managed micro-services related to AI platforms (Inference, Training, Evaluation, Ingestion)
  • Experience deploying, supporting, and supervising services, platforms, and application stacks
  • Familiarity with CI/CD tools such as Jenkins and GitHub Actions
  • Experience with Infrastructure as Code (IaC) tools and methodologies
  • Strong background working with MS Windows Server and/or Linux operating systems
  • Excellent communication skills, able to explain technical issues to non-technical audiences

Ways to Stand Out

  • Expertise with Azure and AWS cloud platforms
  • Passion and experience in AI methodologies
  • Strong software design and development background
  • Systematic problem-solving skills with strong ownership and communication

Benefits

  • Competitive salary range: $248,000 - $391,000 USD per year (base salary depends on location, experience, and market benchmarks)
  • Equity and additional benefits offered by NVIDIA

NVIDIA values diversity and is an equal opportunity employer committed to inclusive hiring and promotion practices.

#LI-Hybrid