Senior Site Reliability Engineer - Storage
at Nvidia
π Santa Clara, United States
USD 168,000-322,000 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 Chef @ 4 ElasticSearch @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 Kibana @ 4 Python @ 6 Hiring @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Puppet @ 4 Cloud Computing @ 4Details
NVIDIA is hiring a Senior Site Reliability Engineer focused on HPC storage to design, implement, and optimize on-prem High-Performance Computing (HPC) storage solutions while leveraging cloud computing. The role involves building distributed storage solutions, automating deployments and operations, collaborating with engineering teams, and documenting best practices for distributed file systems.
Responsibilities
- Design and implement on-prem HPC infrastructure supplemented with cloud computing to support NVIDIA's IT needs.
- Design and implement advanced storage solutions, such as high-performance NFS, S3-compatible object storage, and distributed storage systems.
- Develop tooling to automate deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources.
- Document procedures and best practices and perform technology evaluations related to distributed file systems.
- Collaborate with engineering teams to understand developers' workflows and gather infrastructure requirements.
- Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.
Requirements
- BS in Computer Science (or equivalent experience) with 8+ years of relevant experience; or MS with 5+ years; or Ph.D. with 3+ years.
- Deep experience with storage protocols such as NFS, NVMe/TCP, S3 and Lustre (LNet).
- Experience with containerization technologies like Kubernetes and integration of storage with container platforms.
- Proficiency in one or more programming languages (Python, Go) is required.
- Experience with monitoring and configuration management tools such as Chef, Ansible, Puppet, SaltStack, etc.
- Background with cloud infrastructure (AWS, Azure, or Google Cloud).
- Experience with monitoring stacks such as Prometheus + Grafana and Elasticsearch + Kibana.
- Excellent communication and collaboration skills.
Nice to Have / Ways to Stand Out
- Knowledge of HPC and AI solution technologies (CPUs, GPUs, high-speed interconnects, and supporting software).
- Experience with RDMA fabrics (InfiniBand or RoCE).
- Background with HPC cluster management tools such as Slurm, PBS, LSF.
- Passion and experience in AI methodologies.
Benefits
- Base salary is determined by location and experience. Base salary ranges provided in the posting: 168,000 USD - 264,500 USD for Level 4; 200,000 USD - 322,000 USD for Level 5.
- Eligible for equity and other benefits (see NVIDIA benefits).
- Role is listed as hybrid (#LI-Hybrid).
NVIDIA is an equal opportunity employer committed to fostering a diverse work environment. Applications accepted at least until August 24, 2025.