Principal Architect, Site Reliability Engineering - Geforce Now
at Nvidia
USD 248,000-391,000 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Ansible @ 6 Go @ 7 Grafana @ 6 Kubernetes @ 4 Prometheus @ 6 DevOps @ 4 Terraform @ 6 Python @ 7 GCP @ 4 CI/CD @ 7 Datadog @ 6 Distributed Systems @ 7 Leadership @ 8 AWS @ 4 Azure @ 4 Bash @ 7 Parallel Programming @ 3 SRE @ 4 Microservices @ 7 Technical Leadership @ 8Details
NVIDIA is the world leader in accelerated computing—from gaming to data centers to AI and robotics. We are a team of trailblazers reinventing computing at the intersection of graphics, high-performance computing, and AI. If you’re driven to tackle sophisticated challenges, push boundaries, and build technology that powers the future, NVIDIA is the place for you.
Responsibilities
- Design and architect scalable, resilient infrastructure for cloud-native and hybrid services.
- Define and implement SRE principles, SLAs, SLOs, and error budgets across teams and services.
- Collaborate with multi-functional teams to ensure reliability, observability, performance, and security.
- Lead architecture reviews, disaster recovery planning, incident response strategies, and postmortems.
- Develop automation frameworks for deployment, monitoring, and remediation of systems.
- Champion a culture of reliability, continuous improvement, and operational excellence.
- Mentor SREs and DevOps engineers, sharing knowledge and standard methodologies across the organization.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- 15+ years of experience in infrastructure, cloud, or SRE roles, including at least 5+ years in an architectural or technical leadership position.
- Expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (Kubernetes).
- Deep understanding of distributed systems, microservices architecture, and CI/CD pipelines.
- Proficient with observability tools (Prometheus, Grafana, ELK/EFK, Datadog) and infrastructure as code (Terraform, Ansible).
- Strong programming/scripting skills (Python, Go, Bash, etc.).
- Ability to communicate your ideas/code clearly through documents, presentations, etc.
Ways to Stand Out from the Crowd
- AWS, GCP, or Azure Professional Solution Architect Certification.
- Familiarity with parallel programming and distributed computing platforms.
- Experience in developing large-scale and complex applications.
- Cross-platform development experience.
Benefits
- Eligible for equity and benefits as detailed on NVIDIA's benefits page.
Applications for this job will be accepted at least until August 2, 2025.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer, valuing diversity and not discriminating on various protected characteristics.