Senior Site Reliability Engineer, DGX Cloud

at Nvidia
πŸ“ World
πŸ“ Canada
πŸ“ United States
USD 208,000-333,500 per year
SENIOR
βœ… Remote βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Ansible @ 4 Chef @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 6 GCP @ 4 Datadog @ 4 Distributed Systems @ 7 AWS @ 4 Azure @ 4 Networking @ 4 SRE @ 6 Microservices @ 7 Reporting @ 4 Splunk @ 4 Puppet @ 4 OpenTelemetry @ 4 GPU @ 4

Details

NVIDIA is driving AI and high-performance computing forward. DGX Cloud aims to deliver a fully managed AI platform on major cloud providers, optimizing AI workloads using high-performance NVIDIA infrastructure. Work with NVIDIA's DGX Cloud team as a Senior Site Reliability Engineer to maintain high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide.

Responsibilities

  • Support large-scale Kubernetes services before launch through system-creation consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
  • Build, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting.
  • Maintain services in production by measuring and monitoring availability, latency, and overall system health.
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds.
  • Scale systems sustainably through automation and drive improvements that increase reliability and velocity.
  • Lead triage and root-cause analysis of high-severity incidents; practice balanced incident response and blameless postmortems.
  • Participate in on-call rotation to support production services.

Requirements

  • BS in Computer Science or related technical field, or equivalent experience.
  • 12+ years of experience operating production services at scale.
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale.
  • Experience with infrastructure automation tools (examples listed: Terraform, Ansible, Chef, Puppet).
  • Proficiency in at least one high-level programming language (examples: Python, Go).
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards.
  • Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments.
  • Proficient knowledge of SRE principles, including SLOs, SLIs, error budgets, and incident handling.
  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.

Preferred / Ways to Stand Out

  • Experience operating GPU-accelerated clusters with KubeVirt in production.
  • Applying generative-AI techniques to reduce operational toil.
  • Experience automating incidents with Shoreline or StackStorm.
  • Experience with GPU workload orchestration and large-scale GPU resource management.

Benefits

  • Competitive base salary (range below), eligibility for equity, and a generous benefits package. NVIDIA is an equal opportunity employer and committed to diversity.
  • Base salary range: 208,000 USD - 333,500 USD.
  • Applications accepted at least until September 2, 2025.