Distinguished Site Reliability Engineer - Cloud

at Nvidia

📍 United States

USD 320,000-488,800 per year

SENIOR

✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Docker @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Ruby @ 4 Python @ 4 Distributed Systems @ 7 Communication @ 7 Mathematics @ 4 Networking @ 4 OpenStack @ 4 Perl @ 4 SRE @ 4 GPU @ 4

Details

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role focuses on ensuring internal and external GPU cloud services run with maximum reliability and uptime while enabling developers to make changes safely through automation, performance tuning and capacity management.

Responsibilities

Lead, design, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real-time monitoring, logging and alerting
Engage in and improve the whole lifecycle of services — from inception and design through deployment, operation and refinement
Support services before they go live through system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
Maintain services once live by measuring and monitoring availability, latency and overall system health
Scale systems sustainably through automation and by driving changes that improve reliability and velocity
Practice sustainable incident response and blameless postmortems
Participate in an on-call rotation to support production systems

Requirements

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
16+ years of experience with infrastructure automation and distributed systems design; experience designing and developing tools for running large-scale private or public cloud systems in production
Experience in one or more of: Python, Go, Perl or Ruby
In-depth knowledge of Linux, networking and containers
Experience across coding, databases, capacity management, continuous delivery and deployment, and open-source cloud technologies

Ways to stand out

Interest in crafting, analyzing and fixing large-scale distributed systems
Systematic problem-solving approach, strong communication skills, sense of ownership and drive
Ability to debug and optimize code and automate routine tasks
Experience using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

Benefits

Base salary range: 320,000 USD - 488,750 USD
Eligible for equity and NVIDIA benefits (link to company benefits)