Senior Site Reliability Engineer, DGX Cloud

at Nvidia

📍 World
📍 Canada
📍 United States

USD 208,000-333,500 per year

SENIOR

✅ Remote ✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Ansible @ 4 Chef @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 6 GCP @ 4 Datadog @ 4 Distributed Systems @ 7 AWS @ 4 Azure @ 4 Networking @ 4 SRE @ 6 Microservices @ 7 Reporting @ 4 Splunk @ 4 Puppet @ 4 OpenTelemetry @ 4 GPU @ 4

Details

NVIDIA is driving AI and high-performance computing forward. DGX Cloud aims to deliver a fully managed AI platform on major cloud providers, optimizing AI workloads using high-performance NVIDIA infrastructure. Work with NVIDIA's DGX Cloud team as a Senior Site Reliability Engineer to maintain high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide.

Responsibilities

Support large-scale Kubernetes services before launch through system-creation consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
Build, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
Define SLOs/SLIs, monitor error budgets, and streamline reporting.
Maintain services in production by measuring and monitoring availability, latency, and overall system health.
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds.
Scale systems sustainably through automation and drive improvements that increase reliability and velocity.
Lead triage and root-cause analysis of high-severity incidents; practice balanced incident response and blameless postmortems.
Participate in on-call rotation to support production services.

Requirements

BS in Computer Science or related technical field, or equivalent experience.
12+ years of experience operating production services at scale.
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale.
Experience with infrastructure automation tools (examples listed: Terraform, Ansible, Chef, Puppet).
Proficiency in at least one high-level programming language (examples: Python, Go).
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards.
Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments.
Proficient knowledge of SRE principles, including SLOs, SLIs, error budgets, and incident handling.
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.

Preferred / Ways to Stand Out

Experience operating GPU-accelerated clusters with KubeVirt in production.
Applying generative-AI techniques to reduce operational toil.
Experience automating incidents with Shoreline or StackStorm.
Experience with GPU workload orchestration and large-scale GPU resource management.

Benefits

Competitive base salary (range below), eligibility for equity, and a generous benefits package. NVIDIA is an equal opportunity employer and committed to diversity.
Base salary range: 208,000 USD - 333,500 USD.
Applications accepted at least until September 2, 2025.