Senior Production SRE Engineer - Storage

at Nvidia

📍 Santa Clara, United States

$148,000-339,200 per year

SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Chef @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Ruby @ 4 Terraform @ 4 Python @ 4 Java @ 4 Algorithms @ 4 Data Structures @ 4 Machine Learning @ 4 Mathematics @ 4 Networking @ 4 Perl @ 4 SRE @ 4 Puppet @ 4

Details

Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and possess expertise in different domains such as systems, networking, storage, coding, database management, capacity management, continuous delivery, and deployment, as well as open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Their responsibilities encompass ensuring reliable storage solutions, managing data efficiently, and providing related services to support the overall stability and performance of the production systems.

Responsibilities

  • Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.
  • Work closely with peers on the team to improve the lifecycle of services – from inception and design, through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health, including leveraging machine learning models.
  • Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Be part of an on-call rotation to support production systems.

Requirements

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
  • At least 5+ years practical experience.
  • Experience with algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
  • Experience in one or more of the following: C/C++, Java, Python, Go, Perl or Ruby, AI/ML frameworks and methodologies.
  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.

Benefits

  • You will also be eligible for equity and benefits.
  • NVIDIA accepts applications on an ongoing basis.