Senior Site Reliability Engineer

at Nvidia
USD 148,000-287,500 per year
SENIOR
āœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Go @ 6 Kubernetes @ 6 Linux @ 7 Ruby @ 6 IaC @ 6 Python @ 6 GCP @ 4 CI/CD @ 6 AWS @ 4 Azure @ 4 Communication @ 7 Networking @ 7 SRE @ 4 Debugging @ 7 Oracle @ 4

Details

Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

Responsibilities

  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
  • Continuously improve solution provisioning and management through automation.
  • Identify areas to improve service resiliency using industry-standard practices.
  • Detect performance issues and recommend solutions to maintain world-class service quality.
  • Conduct capacity management and planning to meet ongoing operational needs.
  • Participate in incident reviews, assist in root cause identification, and write RCA reports.
  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment (AWS, GCP, and On-prem).
  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
  • Participate in the team's on-call rotation.

Requirements

  • B.S. degree in Computer Science or related technical field (or equivalent experience) with 5+ years in building and supporting critical services.
  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
  • Deep understanding of Linux operating systems and TCP/IP fundamentals.
  • Expertise with at least one major cloud service provider - AWS, GCP, or Azure.
  • Demonstrated proficiency with end-to-end SRE capabilities and observability.
  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
  • Creative problem solver with excellent debugging skills and strong communication and documentation abilities.

Ways to stand out (nice to have)

  • Linux certification from a well-known vendor (RedHat, Oracle, etc.).
  • Prior experience managing large-scale Kubernetes deployments in production.
  • Strong skills in modern container networking and storage architecture.
  • Experience designing AI chatbots and agentic automation workflows.
  • Hands-on experience working with Slurm/LSF environments.

Compensation & Benefits

  • Base salary ranges (location- and level-dependent):
    • Level 3: 148,000 USD - 235,750 USD per year
    • Level 4: 184,000 USD - 287,500 USD per year
  • You will also be eligible for equity and benefits (see NVIDIA benefits page).

Other information

  • Location: Santa Clara, CA, USA
  • Time type: Full time
  • Team participates in an on-call rotation
  • Applications accepted at least until August 16, 2025.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.