Principal Site Reliability Engineer - Enterprise Ai Platform
at Nvidia
USD 248,000-391,000 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Go @ 6 Jenkins @ 3 Kubernetes @ 4 Linux @ 7 Ruby @ 6 IaC @ 4 Python @ 6 GitHub @ 3 GitHub Actions @ 3 CI/CD @ 3 Distributed Systems @ 4 Hiring @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Perl @ 6 SRE @ 7 Agile @ 4 GPU @ 4Details
NVIDIA is hiring a deeply technical and creative Site Reliability Engineer to build, support, and maintain next-generation AI-powered enterprise products that enhance engineering efficiency, data security, and product development. This role involves collaboration with Cloud and AI/ML teams in a dynamic, agile environment.
Responsibilities
- Collaborate on translating business objectives into actionable plans
- Address operational challenges, automate processes, and iterate for efficiency
- Tackle systemic reliability issues with cross-functional teams
- Monitor, optimize, and manage system performance and resources
- Institute validated practices for reliability, remediation, and troubleshooting
- Design, deploy, and automate production support, documenting essential knowledge
- Navigate complex tasks with deep understanding of SRE principles
- Lead cross-organizational projects from inception to completion
- Mentor and train junior engineers
- Serve as a subject matter expert in core team functions
Requirements
- 15+ years of experience in cloud, platform, or SRE roles
- Bachelor’s or Master’s degree in Engineering, Computer Science, or related field or equivalent experience
- Proficient in Python, Go, Perl, or Ruby
- Hands-on experience scaling distributed systems in public, private, or hybrid cloud, on-prem 24x7x365 environments
- Experience deploying applications in Kubernetes clusters with GPU and CPU pod scheduling
- Managed micro-services related to AI platforms (Inference, Training, Evaluation, Ingestion)
- Experience deploying, supporting, and supervising services, platforms, and application stacks
- Familiarity with CI/CD tools such as Jenkins and GitHub Actions
- Experience with Infrastructure as Code (IaC) tools and methodologies
- Strong background working with MS Windows Server and/or Linux operating systems
- Excellent communication skills, able to explain technical issues to non-technical audiences
Ways to Stand Out
- Expertise with Azure and AWS cloud platforms
- Passion and experience in AI methodologies
- Strong software design and development background
- Systematic problem-solving skills with strong ownership and communication
Benefits
- Competitive salary range: $248,000 - $391,000 USD per year (base salary depends on location, experience, and market benchmarks)
- Equity and additional benefits offered by NVIDIA
NVIDIA values diversity and is an equal opportunity employer committed to inclusive hiring and promotion practices.
#LI-Hybrid