Senior Manager - Storage Production Engineering And SRE

at Nvidia

πŸ“ Santa Clara, United States

$272,000-419,800 per year

SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Kubernetes @ 6 Leadership @ 4 People Management @ 7 Team Management @ 6 AWS @ 4 Azure @ 4 Mentoring @ 4 Networking @ 6 SRE @ 4

Details

As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers bring specialized expertise in areas such as systems, networking, storage, coding, database management, capacity planning, continuous delivery and deployment, and proficiency in open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Your role involves overseeing the implementation of reliable storage solutions, efficient data management, and delivering associated services to uphold the overall stability and performance of production systems.

Responsibilities

  • Leadership: Formulating and executing strategic initiatives to enhance the reliability and performance of storage systems, aligning with organizational goals.
  • Team Management: Leading and mentoring a team of Storage SRE professionals, fostering a collaborative and innovative work environment.
  • Cloud Storage Expertise: Supervise the planning, execution, and enhancement of storage solutions, encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. Guarantee the efficient utilization of cloud-native storage services offered by platforms like AWS S3 and Azure Blob Storage.
  • System Optimization: Collaborating with multi-functional teams to optimize storage systems, implement best practices, and ensure seamless integration with other technology stacks.
  • Incident Response: Overseeing incident response and resolution for storage-related issues, minimizing downtime, and ensuring a resilient storage environment.
  • Capacity Planning: Conducting capacity planning exercises and collaborating with team members to forecast and meet storage demands efficiently.
  • Automation and Tooling: Driving automation initiatives to streamline storage operations and developing tools for monitoring, alerting, and performance analysis.
  • Continuous Improvement: Implementing continuous improvement processes to enhance storage systems' overall reliability and efficiency.

Requirements

  • Extensive experience in a senior-level role within Site Reliability Engineering, particularly in managing storage infrastructure.
  • Technical Expertise: In-depth knowledge of storage technologies, file systems, and experience with cloud-based storage solutions. Proficiency in scripting and automation tools is essential.
  • Leadership Skills: Strong leadership and people management skills, with the ability to inspire and guide a team towards achieving common objectives.
  • Problem-Solving Skills: Exceptional analytical and problem-solving skills, with the ability to address complex storage-related issues effectively.
  • Collaboration: Demonstrated ability to collaborate with multi-functional teams and communicate effectively with technical and non-technical collaborators.
  • Prior engineering experience with hands-on coding background in storage systems.
  • Master's degree in Computer Science, Information Technology, or a related field or equivalent experience.
  • 10+ overall years of relevant experience and 5+ years of management experience.

Benefits

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success.
  • Professional certifications in relevant technologies (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator). Experience with container orchestration platforms and software-defined storage solutions.
  • Proven track record of implementing and managing storage solutions in a large-scale, enterprise environment. Thrive in collaborative environments and enjoy working with various teams. Flexible in adapting to different working styles.

The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.