Principal Architect, Site Reliability Engineering - GeForce Now

at Nvidia

📍 Santa Clara, United States

USD 248,000-391,000 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Ansible @ 6 Go @ 7 Grafana @ 6 Kubernetes @ 4 Prometheus @ 6 DevOps @ 4 Terraform @ 6 Python @ 7 GCP @ 4 CI/CD @ 7 Datadog @ 6 Distributed Systems @ 7 Leadership @ 8 AWS @ 4 Azure @ 4 Bash @ 7 Parallel Programming @ 3 SRE @ 4 Microservices @ 7 Technical Leadership @ 8

Details

NVIDIA is seeking a Principal Architect for Site Reliability Engineering (SRE) to join the GeForce Now Engineering team. The role defines architecture and strategic direction for highly available, scalable, and secure systems that power critical services and platforms. The position requires collaboration with product, platform, and infrastructure teams to establish best practices, improve reliability, and drive the evolution of the SRE function.

Responsibilities

Design and architect scalable, resilient infrastructure for cloud-native and hybrid services.
Define and implement SRE principles, SLAs, SLOs, and error budgets across teams and services.
Collaborate with multi-functional teams to ensure reliability, observability, performance, and security.
Lead architecture reviews, disaster recovery planning, incident response strategies, and postmortems.
Develop automation frameworks for deployment, monitoring, and remediation of systems.
Champion a culture of reliability, continuous improvement, and operational excellence.
Mentor SREs and DevOps engineers, sharing knowledge and standard methodologies across the organization.

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
15+ years of experience in infrastructure, cloud, or SRE roles, including at least 5+ years in an architectural or technical leadership position.
Expertise in cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
Deep understanding of distributed systems, microservices architecture, and CI/CD pipelines.
Proficient with observability tools (Prometheus, Grafana, ELK/EFK, Datadog) and infrastructure as code (Terraform, Ansible).
Strong programming/scripting skills (Python, Go, Bash, etc.).
Ability to communicate ideas and code clearly through documents and presentations.

Ways to stand out from the crowd

AWS, GCP, or Azure Professional Solution Architect Certification.
Familiarity with parallel programming and distributed computing platforms.
Experience in developing large-scale and complex applications.
Cross-platform development experience.

Benefits

Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and pay of employees in similar positions).
Eligibility for equity and company benefits (see NVIDIA benefits).
Application deadline: at least until August 2, 2025.

Technologies & Concepts Mentioned

Kubernetes, AWS, Azure, GCP, distributed systems, microservices, CI/CD pipelines, Prometheus, Grafana, ELK/EFK, Datadog, Terraform, Ansible, Python, Go, Bash, SLAs, SLOs, error budgets, automation, capacity management, disaster recovery, incident response.

NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.