Senior Site Reliability Engineer, IaaS and PaaS

at Nvidia

📍 Santa Clara, United States

USD 168,000-333,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 7 Docker @ 4 Go @ 4 Kubernetes @ 4 Terraform @ 4 Python @ 4 CI/CD @ 4 Distributed Systems @ 7 AWS @ 4 Azure @ 4 OpenStack @ 4 SRE @ 4 React @ 4 IaaS @ 4 QA @ 4 OpenShift @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today we are tapping into the unlimited potential of AI to define the next era of computing. NVIDIA DGX Cloud is a cloud platform tailored for AI tasks, enabling organizations to transition AI projects from development to deployment. This role is part of the DGX Cloud Engineering Team focused on the Omniverse on DGX Cloud platform to build deployment infrastructure processes, SRE measurement, automation tools, and maintain high service operability and reliability.

Responsibilities

Design, build, and implement scalable cloud-based systems for PaaS/IaaS.
Work closely with cross-functional teams on new products, features, and improvements.
Develop, maintain, and improve cloud deployment of software.
Participate in triage and resolution of complex infrastructure-related issues.
Collaborate with developers, QA, and Product teams to refine release processes and software observability to ensure operability, reliability, and availability.
Maintain services in production by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
Develop, maintain, and improve automation tools to improve SRE operational efficiency.
Practice balanced incident response and blameless postmortems.
Participate in an on-call rotation to support production systems.

Requirements

BS or MS in Computer Science or equivalent experience.
8+ years of hands-on software engineering or equivalent experience.
Programming experience in Go and Python; experience with React.
Strong understanding of cloud design including virtualization, global infrastructure, distributed systems, and security.
Expertise in Kubernetes (K8s) and KubeVirt and experience building RESTful web services.
Familiarity with SRE principles: metrics emission for observability, monitoring, alerting using logs, traces, and metrics.
Hands-on experience with Docker, containers, and Infrastructure-as-Code such as Terraform; experience with CI/CD pipelines.
Experience working with cloud service providers (examples given: AWS services such as Fargate, EC2, IAM, ECR, EKS, Route53; Azure).

Ways to stand out

Expertise with StackStorm, OpenStack, Red Hat OpenShift, and AI databases like Milvus.
Demonstrated track record of solving complex problems with elegant solutions and delivering complex projects.
Experience developing frontend applications with concepts such as single-page application (SSA) patterns and RBAC.
Understanding or experience building AI agentic solutions, preferably NVIDIA open-source AI solutions.

Compensation & Other Details

Base salary ranges (location, experience and level dependent):
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
You will also be eligible for equity and benefits. Applications accepted at least until October 18, 2025.

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.