Distinguished Site Reliability Engineer - Cloud

at Nvidia
USD 320,000-488,800 per year
SENIOR
✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Docker @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Ruby @ 4 Python @ 4 Distributed Systems @ 7 Communication @ 7 Mathematics @ 4 Networking @ 4 OpenStack @ 4 Perl @ 4 SRE @ 4 GPU @ 4

Details

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role focuses on ensuring internal and external GPU cloud services run with maximum reliability and uptime while enabling developers to make changes safely through automation, performance tuning and capacity management.

Responsibilities

  • Lead, design, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real-time monitoring, logging and alerting
  • Engage in and improve the whole lifecycle of services — from inception and design through deployment, operation and refinement
  • Support services before they go live through system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
  • Maintain services once live by measuring and monitoring availability, latency and overall system health
  • Scale systems sustainably through automation and by driving changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems
  • Participate in an on-call rotation to support production systems

Requirements

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
  • 16+ years of experience with infrastructure automation and distributed systems design; experience designing and developing tools for running large-scale private or public cloud systems in production
  • Experience in one or more of: Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, networking and containers
  • Experience across coding, databases, capacity management, continuous delivery and deployment, and open-source cloud technologies

Ways to stand out

  • Interest in crafting, analyzing and fixing large-scale distributed systems
  • Systematic problem-solving approach, strong communication skills, sense of ownership and drive
  • Ability to debug and optimize code and automate routine tasks
  • Experience using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

Benefits

  • Base salary range: 320,000 USD - 488,750 USD
  • Eligible for equity and NVIDIA benefits (link to company benefits)