Senior Site Reliability Engineer, AI Infrastructure

at Nvidia

📍 Santa Clara, United States

USD 224,000-425,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 8 Go @ 6 Linux @ 4 Prometheus @ 4 Ruby @ 6 Terraform @ 7 Python @ 6 GCP @ 4 CI/CD @ 4 Distributed Systems @ 4 TensorFlow @ 6 Hiring @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Perl @ 6 SRE @ 4 Performance Optimization @ 4 PyTorch @ 6

Details

NVIDIA is widely considered to be one of the technology world’s most desirable employers, known for innovation in computer graphics, PC gaming, and accelerated computing for over 30 years. Now, the company is focusing on leveraging AI to define the next era of computing, with GPUs powering computers, robots, and self-driving cars.

Join the AI Infrastructure Production engineering team to make a lasting impact with cutting-edge AI infrastructure projects.

Responsibilities

Develop and maintain large-scale systems supporting critical AI Infrastructure use cases, focusing on reliability, operability, and scalability across global public and private clouds.
Implement Site Reliability Engineering (SRE) fundamentals, including incident management, monitoring, and performance optimization.
Design automation tools to reduce manual processes and operational overhead.
Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution.
Drive continuous improvement in system performance.
Establish frameworks for operational maturity, lead incident response protocols, and conduct blameless postmortems.
Collaborate with engineering teams to deliver innovative solutions.
Mentor peers and contribute to maintaining high standards for code and infrastructure.
Participate in hiring for a diverse, high-performing team.

Requirements

Degree in Computer Science or related field, or equivalent experience with 12+ years in Software Development, Site Reliability Engineering, or Production Engineering.
Proficiency in Python and at least one additional programming language (C/C++, Go, Perl, Ruby).
Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, OCI, Azure, GCP).
Strong understanding of SRE principles including error budgets, SLOs, SLAs, and Infrastructure as Code tools such as Terraform CDK.
Hands-on experience with observability platforms (ELK stack, Prometheus, Loki) and CI/CD systems (e.g., GitLab).
Excellent communication skills to convey technical concepts to diverse audiences.
Commitment to fostering a culture of diversity, curiosity, and continuous improvement.

Ways to Stand Out

Experience in AI training, inferencing, and data infrastructure services.
Proficiency with deep learning frameworks including PyTorch, TensorFlow, JAX, and Ray.
Strong background in hardware health monitoring and system reliability.
Hands-on expertise in operating and scaling distributed systems with stringent SLAs to ensure high availability and performance.
Proven experience with incident, change, and problem management processes, promoting continuous improvement in complex environments.

Compensation and Benefits

Base salary range: $224,000 to $425,500 USD per year, depending on location, experience, and peer compensation.
Eligible for equity and benefits offered by NVIDIA.
NVIDIA is an equal opportunity employer committed to diversity.