Senior Manager, Site Reliability Engineering - Dgx Cloud

at Nvidia

📍 Santa Clara, United States

USD 248,000-396,800 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Ansible @ 4 Chef @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 DevOps @ 8 Terraform @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Mentoring @ 7 Networking @ 4 SRE @ 4 Microservices @ 4 Product Management @ 4 Splunk @ 4 Puppet @ 4 Compliance @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years with a legacy of innovation fueled by great technology and talented people. The company is now pioneering the future of AI and high-performance computing, with cloud platforms as a core part of this transformation.

Responsibilities

Recruit, develop, and inspire a team of Site Reliability Engineers (SREs), fostering collaboration, ownership, and technical excellence.
Provide mentorship, guidance, and career development opportunities for the team.
Establish and enforce SRE standard practices including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and incident management processes.
Collaborate with engineering and product teams to design, build, and deploy scalable, fault-tolerant, and high-performance cloud services.
Lead architecture reviews to embed operational considerations early.
Drive automation throughout the service lifecycle: provisioning, deployment, monitoring, incident response, and capacity management.
Implement comprehensive monitoring, alerting, logging, and tracing solutions using data-driven approaches.
Oversee incident response including rapid issue mitigation and blameless post-mortems.
Develop strategies to improve platform scalability and performance.
Partner with engineering, security, product management, and customer success teams for alignment and effective launches.
Ensure implementation of security standards and compliance across cloud operations.
Provide leadership and support for on-call rotations.
Contribute to strategic direction by identifying emerging technologies and improving operational efficiency.

Requirements

Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.
10+ years overall experience in Site Reliability Engineering, DevOps, or similar roles with at least 5 years in leadership/management.
Proven experience operating large-scale distributed systems in cloud environments such as AWS, GCP, or Azure.
Expertise in Kubernetes administration, containerization, and microservices architectures.
Strong understanding of SRE principles including SLOs, SLIs, error budgets, and incident management.
Extensive experience with infrastructure automation tools like Terraform, Ansible, Chef, or Puppet.
Proficiency in programming languages such as Python or Go.
In-depth knowledge of Linux OS, networking fundamentals (TCP/IP), and cloud security standards.
Experience building and operating observability platforms using Prometheus, Grafana, ELK Stack, Splunk, Jaeger, etc.
Strong leadership and mentoring skills.
Excellent communication and problem-solving skills, able to explain complex technical concepts to diverse stakeholders.

Benefits

NVIDIA offers highly competitive salaries and a comprehensive benefits package. Candidates will also be eligible for equity and additional benefits. For more information, visit www.nvidiabenefits.com.

NVIDIA is committed to diversity and equal opportunity.