Principal Architect, Site Reliability Engineering - Geforce Now

at Nvidia
USD 248,000-391,000 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Ansible @ 6 Go @ 7 Grafana @ 6 Kubernetes @ 4 Prometheus @ 6 DevOps @ 4 Terraform @ 6 Python @ 7 GCP @ 4 CI/CD @ 7 Datadog @ 6 Distributed Systems @ 7 Leadership @ 8 AWS @ 4 Azure @ 4 Bash @ 7 Parallel Programming @ 3 SRE @ 4 Microservices @ 7 Technical Leadership @ 8

Details

NVIDIA is the world leader in accelerated computing—from gaming to data centers to AI and robotics. We are a team of trailblazers reinventing computing at the intersection of graphics, high-performance computing, and AI. If you’re driven to tackle sophisticated challenges, push boundaries, and build technology that powers the future, NVIDIA is the place for you.

Responsibilities

  • Design and architect scalable, resilient infrastructure for cloud-native and hybrid services.
  • Define and implement SRE principles, SLAs, SLOs, and error budgets across teams and services.
  • Collaborate with multi-functional teams to ensure reliability, observability, performance, and security.
  • Lead architecture reviews, disaster recovery planning, incident response strategies, and postmortems.
  • Develop automation frameworks for deployment, monitoring, and remediation of systems.
  • Champion a culture of reliability, continuous improvement, and operational excellence.
  • Mentor SREs and DevOps engineers, sharing knowledge and standard methodologies across the organization.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • 15+ years of experience in infrastructure, cloud, or SRE roles, including at least 5+ years in an architectural or technical leadership position.
  • Expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (Kubernetes).
  • Deep understanding of distributed systems, microservices architecture, and CI/CD pipelines.
  • Proficient with observability tools (Prometheus, Grafana, ELK/EFK, Datadog) and infrastructure as code (Terraform, Ansible).
  • Strong programming/scripting skills (Python, Go, Bash, etc.).
  • Ability to communicate your ideas/code clearly through documents, presentations, etc.

Ways to Stand Out from the Crowd

  • AWS, GCP, or Azure Professional Solution Architect Certification.
  • Familiarity with parallel programming and distributed computing platforms.
  • Experience in developing large-scale and complex applications.
  • Cross-platform development experience.

Benefits

  • Eligible for equity and benefits as detailed on NVIDIA's benefits page.

Applications for this job will be accepted at least until August 2, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer, valuing diversity and not discriminating on various protected characteristics.