Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Ansible @ 4 Chef @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 6 GCP @ 4 Datadog @ 4 Distributed Systems @ 7 AWS @ 4 Azure @ 4 Networking @ 4 SRE @ 6 Microservices @ 7 Reporting @ 4 Splunk @ 4 Puppet @ 4 OpenTelemetry @ 4 GPU @ 4Details
NVIDIA is driving AI and high-performance computing forward. DGX Cloud aims to deliver a fully managed AI platform on major cloud providers, optimizing AI workloads using high-performance NVIDIA infrastructure. Work with NVIDIA's DGX Cloud team as a Senior Site Reliability Engineer to maintain high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide.
Responsibilities
- Support large-scale Kubernetes services before launch through system-creation consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
- Build, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
- Define SLOs/SLIs, monitor error budgets, and streamline reporting.
- Maintain services in production by measuring and monitoring availability, latency, and overall system health.
- Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds.
- Scale systems sustainably through automation and drive improvements that increase reliability and velocity.
- Lead triage and root-cause analysis of high-severity incidents; practice balanced incident response and blameless postmortems.
- Participate in on-call rotation to support production services.
Requirements
- BS in Computer Science or related technical field, or equivalent experience.
- 12+ years of experience operating production services at scale.
- Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale.
- Experience with infrastructure automation tools (examples listed: Terraform, Ansible, Chef, Puppet).
- Proficiency in at least one high-level programming language (examples: Python, Go).
- In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards.
- Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments.
- Proficient knowledge of SRE principles, including SLOs, SLIs, error budgets, and incident handling.
- Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.
Preferred / Ways to Stand Out
- Experience operating GPU-accelerated clusters with KubeVirt in production.
- Applying generative-AI techniques to reduce operational toil.
- Experience automating incidents with Shoreline or StackStorm.
- Experience with GPU workload orchestration and large-scale GPU resource management.
Benefits
- Competitive base salary (range below), eligibility for equity, and a generous benefits package. NVIDIA is an equal opportunity employer and committed to diversity.
- Base salary range: 208,000 USD - 333,500 USD.
- Applications accepted at least until September 2, 2025.