Senior Manager, Site Reliability Engineering - Dgx Cloud

at Nvidia
USD 248,000-396,800 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Ansible @ 4 Chef @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 DevOps @ 8 Terraform @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Mentoring @ 7 Networking @ 4 SRE @ 4 Microservices @ 4 Product Management @ 4 Splunk @ 4 Puppet @ 4 Compliance @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years with a legacy of innovation fueled by great technology and talented people. The company is now pioneering the future of AI and high-performance computing, with cloud platforms as a core part of this transformation.

Responsibilities

  • Recruit, develop, and inspire a team of Site Reliability Engineers (SREs), fostering collaboration, ownership, and technical excellence.
  • Provide mentorship, guidance, and career development opportunities for the team.
  • Establish and enforce SRE standard practices including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and incident management processes.
  • Collaborate with engineering and product teams to design, build, and deploy scalable, fault-tolerant, and high-performance cloud services.
  • Lead architecture reviews to embed operational considerations early.
  • Drive automation throughout the service lifecycle: provisioning, deployment, monitoring, incident response, and capacity management.
  • Implement comprehensive monitoring, alerting, logging, and tracing solutions using data-driven approaches.
  • Oversee incident response including rapid issue mitigation and blameless post-mortems.
  • Develop strategies to improve platform scalability and performance.
  • Partner with engineering, security, product management, and customer success teams for alignment and effective launches.
  • Ensure implementation of security standards and compliance across cloud operations.
  • Provide leadership and support for on-call rotations.
  • Contribute to strategic direction by identifying emerging technologies and improving operational efficiency.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.
  • 10+ years overall experience in Site Reliability Engineering, DevOps, or similar roles with at least 5 years in leadership/management.
  • Proven experience operating large-scale distributed systems in cloud environments such as AWS, GCP, or Azure.
  • Expertise in Kubernetes administration, containerization, and microservices architectures.
  • Strong understanding of SRE principles including SLOs, SLIs, error budgets, and incident management.
  • Extensive experience with infrastructure automation tools like Terraform, Ansible, Chef, or Puppet.
  • Proficiency in programming languages such as Python or Go.
  • In-depth knowledge of Linux OS, networking fundamentals (TCP/IP), and cloud security standards.
  • Experience building and operating observability platforms using Prometheus, Grafana, ELK Stack, Splunk, Jaeger, etc.
  • Strong leadership and mentoring skills.
  • Excellent communication and problem-solving skills, able to explain complex technical concepts to diverse stakeholders.

Benefits

NVIDIA offers highly competitive salaries and a comprehensive benefits package. Candidates will also be eligible for equity and additional benefits. For more information, visit www.nvidiabenefits.com.

NVIDIA is committed to diversity and equal opportunity.