Senior Site Reliability Engineer, AI Infrastructure

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 7 Go @ 6 Linux @ 4 Prometheus @ 4 Ruby @ 6 Terraform @ 7 Python @ 6 GCP @ 4 CI/CD @ 4 Distributed Systems @ 4 TensorFlow @ 6 Hiring @ 4 AWS @ 4 Azure @ 4 Communication @ 7 Mentoring @ 4 Perl @ 6 SRE @ 4 Performance Optimization @ 4 PyTorch @ 6

Details

NVIDIA is seeking a Senior Site Reliability Engineer to join the AI Infrastructure Production engineering team. The role focuses on developing and maintaining large-scale systems that support critical AI infrastructure use cases across global public and private clouds. The position emphasizes reliability, observability, automation, and operational maturity while collaborating with engineering teams and mentoring peers.

Responsibilities

Develop and maintain large-scale systems supporting critical AI infrastructure use cases across global public and private clouds.
Implement SRE fundamentals: incident management, monitoring, performance optimization, and automation to reduce manual operational overhead.
Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution.
Establish frameworks for operational maturity, lead incident response protocols, and conduct blameless postmortems to improve resilience.
Work with engineering teams to deliver solutions, mentor team members, uphold high code and infrastructure standards, and participate in hiring.

Requirements

Degree in Computer Science or related field, or equivalent experience with 8+ years in Software Development, SRE, or Production Engineering.
Proficiency in Python and at least one other language: C/C++, Go, Perl, or Ruby.
Expertise in systems engineering within Linux or Windows environments.
Experience with cloud platforms: AWS, OCI, Azure, GCP.
Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (example: Terraform CDK).
Hands-on experience with observability platforms (examples: ELK, Prometheus, Loki) and CI/CD systems (example: GitLab).
Strong communication skills with the ability to convey technical concepts to diverse audiences.
Commitment to fostering diversity, curiosity, and continuous improvement.

Ways to stand out from the crowd

Experience in AI training, inferencing, and data infrastructure services.
Proficiency with deep learning frameworks such as PyTorch, TensorFlow, JAX, and Ray.
Strong background in hardware health monitoring and system reliability.
Hands-on expertise operating and scaling distributed systems with stringent SLAs, ensuring high availability and performance.
Proven experience in incident, change, and problem management processes.

Compensation and application deadline

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. Base salary will be determined based on location, experience, and internal pay equity. Applicants are also eligible for equity and benefits (see https://www.nvidia.com/en-us/benefits/). Applications will be accepted at least until August 31, 2025.

Benefits & Equal opportunity

You will be eligible for equity and company benefits. NVIDIA is committed to a diverse work environment and is an equal opportunity employer. They do not discriminate in hiring or promotion on the basis of any characteristic protected by law.