Senior Site Reliability Engineer
at Nvidia
š Santa Clara, United States
USD 224,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 6 Kubernetes @ 6 Linux @ 7 Ruby @ 6 IaC @ 6 Python @ 6 GCP @ 4 CI/CD @ 6 AWS @ 4 Azure @ 4 Communication @ 7 Networking @ 7 SRE @ 4 Debugging @ 7 Oracle @ 4Details
Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.
Responsibilities
- Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
- Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
- Continuously improve solution provisioning and management through automation.
- Identify areas to improve service resiliency using industry-standard practices.
- Detect performance issues and recommend solutions to maintain world-class service quality.
- Conduct capacity management and planning to meet ongoing operational needs.
- Participate in incident reviews, assist in root cause identification, and write RCA reports.
- Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment (AWS, GCP, and on-prem).
- Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
- Participate in the team's on-call rotation.
Requirements
- B.S. degree in Computer Science or related technical field (or equivalent experience) with over 12+ years in building and supporting critical services.
- Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
- Deep understanding of Linux operating systems and TCP/IP fundamentals.
- Expertise with at least one major cloud service provider (AWS, GCP, Azure).
- Demonstrated proficiency with end-to-end SRE capabilities and observability.
- Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
- 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
- Creative problem solver with excellent debugging skills and strong communication and documentation abilities.
Ways to stand out
- Linux certification from a well-known vendor (RedHat, Oracle, etc.).
- Prior experience managing large-scale Kubernetes deployments in production.
- Strong skills in modern container networking and storage architecture.
- Well-known Cloud Certification(s).
- Hands-on experience working with Slurm/LSF environments.
Compensation & Benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits (see NVIDIA benefits pages).
Additional information
- Location: Santa Clara, CA, USA.
- Role type: Full time. #LI-Hybrid
- Applications for this job will be accepted at least until August 16, 2025.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.