Senior Manager, Site Reliability Engineering - Dgx Cloud
at Nvidia
USD 248,000-396,800 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Ansible @ 4 Chef @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 DevOps @ 8 Terraform @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 Leadership @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Mentoring @ 7 Networking @ 4 SRE @ 4 Microservices @ 4 Product Management @ 4 Splunk @ 4 Puppet @ 4 Compliance @ 4Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years with a legacy of innovation fueled by great technology and talented people. The company is now pioneering the future of AI and high-performance computing, with cloud platforms as a core part of this transformation.
Responsibilities
- Recruit, develop, and inspire a team of Site Reliability Engineers (SREs), fostering collaboration, ownership, and technical excellence.
- Provide mentorship, guidance, and career development opportunities for the team.
- Establish and enforce SRE standard practices including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and incident management processes.
- Collaborate with engineering and product teams to design, build, and deploy scalable, fault-tolerant, and high-performance cloud services.
- Lead architecture reviews to embed operational considerations early.
- Drive automation throughout the service lifecycle: provisioning, deployment, monitoring, incident response, and capacity management.
- Implement comprehensive monitoring, alerting, logging, and tracing solutions using data-driven approaches.
- Oversee incident response including rapid issue mitigation and blameless post-mortems.
- Develop strategies to improve platform scalability and performance.
- Partner with engineering, security, product management, and customer success teams for alignment and effective launches.
- Ensure implementation of security standards and compliance across cloud operations.
- Provide leadership and support for on-call rotations.
- Contribute to strategic direction by identifying emerging technologies and improving operational efficiency.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.
- 10+ years overall experience in Site Reliability Engineering, DevOps, or similar roles with at least 5 years in leadership/management.
- Proven experience operating large-scale distributed systems in cloud environments such as AWS, GCP, or Azure.
- Expertise in Kubernetes administration, containerization, and microservices architectures.
- Strong understanding of SRE principles including SLOs, SLIs, error budgets, and incident management.
- Extensive experience with infrastructure automation tools like Terraform, Ansible, Chef, or Puppet.
- Proficiency in programming languages such as Python or Go.
- In-depth knowledge of Linux OS, networking fundamentals (TCP/IP), and cloud security standards.
- Experience building and operating observability platforms using Prometheus, Grafana, ELK Stack, Splunk, Jaeger, etc.
- Strong leadership and mentoring skills.
- Excellent communication and problem-solving skills, able to explain complex technical concepts to diverse stakeholders.
Benefits
NVIDIA offers highly competitive salaries and a comprehensive benefits package. Candidates will also be eligible for equity and additional benefits. For more information, visit www.nvidiabenefits.com.
NVIDIA is committed to diversity and equal opportunity.