Senior Site Reliability Engineer - Storage

at Nvidia

📍 Santa Clara, United States

USD 168,000-322,000 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 Chef @ 4 ElasticSearch @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 Kibana @ 4 Python @ 6 AWS @ 4 Azure @ 4 Communication @ 4 Puppet @ 4 Cloud Computing @ 4

Details

NVIDIA is leading the way in developments in Artificial Intelligence, High-Performance Computing, and Visualization. Join our team as a Senior Site Reliability Engineer focused on HPC storage to design, implement, and optimize on-prem High-Performance Computing (HPC) storage solutions while leveraging cloud computing. You will craft and deploy distributed storage solutions, build automation tooling, and ensure efficient operations of a growing IT ecosystem. You will collaborate with engineering teams, document best practices, and contribute to groundbreaking projects.

Responsibilities

Design and implement on-prem HPC infrastructure supplemented with cloud computing to support NVIDIA's IT needs.
Design and implement advanced storage solutions, such as high-performance NFS, S3-compatible object storage, and distributed storage systems.
Develop tooling to automate deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources.
Document procedures and practices; perform technology evaluations related to distributed file systems.
Collaborate across teams to understand developer workflows and gather infrastructure requirements.
Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

Requirements

BS in Computer Science (or equivalent experience) with 8+ years of relevant experience; MS with 5+ years or Ph.D. with 3+ years of experience.
Deep experience with storage protocols such as NFS, NVMe/TCP, S3 and Lustre (LNet).
Experience with containerization technologies like Kubernetes and their integration with storage solutions.
Proficiency in one or more programming languages (Python, Go) is a must.
Experience with monitoring and configuration management tools such as Chef, Ansible, Puppet, Saltstack, etc.
Background with cloud infrastructure (AWS, Azure, or Google Cloud).
Experience with monitoring stacks such as Prometheus + Grafana and Elasticsearch + Kibana.
Excellent communication and collaboration skills.

Ways to Stand Out

Knowledge of HPC and AI solution technologies from CPUs and GPUs to high speed interconnects and supporting software.
Experience with RDMA (InfiniBand or RoCE) fabrics.
Background with HPC cluster management tools such as Slurm, PBS, LSF.
Passionate and experienced in AI methodologies.

Compensation & Benefits

Base salary ranges by level: Level 4: 168,000 USD - 264,500 USD; Level 5: 200,000 USD - 322,000 USD.
You will also be eligible for equity and benefits (link provided in original posting).
Applications for this job will be accepted at least until August 24, 2025.

Additional Information

#LI-Hybrid (hybrid work arrangement indicated)
NVIDIA is an equal opportunity employer committed to diversity and inclusion.