Principal Site Reliability Engineer - Enterprise AI Platform
at Nvidia
USD 248,000-391,000 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Go @ 6 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 Ruby @ 6 IaC @ 4 Python @ 6 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 Distributed Systems @ 4 Hiring @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 7 Mentoring @ 4 Perl @ 6 SRE @ 8 Microservices @ 4 TAG @ 4 Agile @ 4 GPU @ 4Details
NVIDIA is hiring a deeply technical and creative Site Reliability Engineer to build, support, and maintain next-generation AI-powered enterprise products that improve engineering efficiency, data security, and product development. The role collaborates with Cloud and AI/ML teams in a dynamic, agile environment and includes leadership responsibilities across projects and mentoring junior engineers.
Responsibilities
- Translate business objectives into actionable operational and engineering plans.
- Address operational challenges, automate processes, and iterate for efficiency.
- Tackle systemic reliability issues with multi-functional teams.
- Monitor, optimize, and manage system performance and resource usage.
- Institute validated practices for reliability, remediation, and troubleshooting.
- Design, deploy, and automate production support; document essential knowledge.
- Lead cross-organizational projects from inception to completion.
- Mentor and train junior engineers; serve as a subject matter expert for core team functions.
Requirements
- 15+ years of working experience in cloud, platform, or SRE roles.
- Bachelor's or Master's Degree in Engineering, Computer Science, or a related field, or equivalent experience.
- Proficient in one or more programming languages: Python, Go, Perl, or Ruby.
- Hands-on experience handling and scaling distributed systems in public, private, hybrid cloud, or on-prem environments (24x7x365 operations).
- Experience delivering software and deploying applications in Kubernetes clusters, including GPU and CPU pod scheduling (ability to understand on-prem deployments).
- Experience maintaining and managing microservices related to AI platforms (inference, training, evaluation, ingestion).
- Hands-on experience deploying, supporting, and supervising new and existing services, platforms, and application stacks.
- Experience with CI/CD systems such as Jenkins, GitHub Actions, etc.
- Background with Infrastructure as Code (IaC) methodologies and relevant tools.
- Extensive experience working with Microsoft Windows Server and/or Linux operating systems.
- Strong communication skills, able to articulate technical issues to non-technical audiences.
Ways to stand out
- Cloud expertise in Azure and AWS.
- Passion and experience in AI methodologies.
- Strong background in software design and development.
- Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive.
Compensation & Additional Information
- Base salary range: 248,000 USD - 391,000 USD (determined by location, experience, and comparable roles).
- Eligible for equity and benefits.
- Applications accepted at least until July 29, 2025.
- Note: Job listing includes the tag "#LI-Hybrid" indicating a hybrid work arrangement.