Senior Manager - Storage Production Engineering and SRE

at Nvidia

📍 Santa Clara, United States

$272,000-419,800 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Kubernetes @ 6 Leadership @ 4 People Management @ 7 Team Management @ 6 AWS @ 4 Azure @ 4 Mentoring @ 4 Networking @ 6 SRE @ 4

Details

As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers bring specialized expertise in areas such as systems, networking, storage, coding, database management, capacity planning, continuous delivery and deployment, and proficiency in open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Your role involves overseeing the implementation of reliable storage solutions, efficient data management, and delivering associated services to uphold the overall stability and performance of production systems.

Responsibilities

Leadership: Formulating and executing strategic initiatives to enhance the reliability and performance of storage systems, aligning with organizational goals.
Team Management: Leading and mentoring a team of Storage SRE professionals, fostering a collaborative and innovative work environment.
Cloud Storage Expertise: Supervise the planning, execution, and enhancement of storage solutions, encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. Guarantee the efficient utilization of cloud-native storage services offered by platforms like AWS S3 and Azure Blob Storage.
System Optimization: Collaborating with multi-functional teams to optimize storage systems, implement best practices, and ensure seamless integration with other technology stacks.
Incident Response: Overseeing incident response and resolution for storage-related issues, minimizing downtime, and ensuring a resilient storage environment.
Conducting capacity planning exercises and collaborating with team members to forecast and meet storage demands efficiently.
Automation and Tooling: Driving automation initiatives to streamline storage operations and developing tools for monitoring, alerting, and performance analysis.
Continuous Improvement: Implementing continuous improvement processes to enhance storage systems' overall reliability and efficiency.

Requirements

Extensive experience in a senior-level role within Site Reliability Engineering, particularly in managing storage infrastructure.
Technical Expertise: In-depth knowledge of storage technologies, file systems, and experience with cloud-based storage solutions. Proficiency in scripting and automation tools is essential.
Leadership Skills: Strong leadership and people management skills, with the ability to inspire and guide a team towards achieving common objectives.
Problem-Solving Skills: Exceptional analytical and problem-solving skills, with the ability to address complex storage-related issues effectively.
Collaboration: Demonstrated ability to collaborate with multi-functional teams and communicate effectively with technical and non-technical collaborators.
Prior engineering experience with hands-on coding background in storage systems.
Education: Master's degree in Computer Science, Information Technology, or a related field or equivalent experience.
10+ overall years of relevant experience and 5+ years of management experience.

Benefits

NVIDIA offers a base salary range of 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.