Senior Site Reliability Engineer - Storage

at Nvidia
USD 168,000-322,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Chef @ 4 ElasticSearch @ 4 Go @ 6 Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 Kibana @ 4 Python @ 6 AWS @ 4 Azure @ 4 Communication @ 4 Puppet @ 4 Cloud Computing @ 4

Details

NVIDIA is leading the way in developments in Artificial Intelligence, High-Performance Computing, and Visualization. Join our team as a Senior Site Reliability Engineer focused on HPC storage to design, implement, and optimize on-prem High-Performance Computing (HPC) storage solutions while leveraging cloud computing. You will craft and deploy distributed storage solutions, build automation tooling, and ensure efficient operations of a growing IT ecosystem. You will collaborate with engineering teams, document best practices, and contribute to groundbreaking projects.

Responsibilities

  • Design and implement on-prem HPC infrastructure supplemented with cloud computing to support NVIDIA's IT needs.
  • Design and implement advanced storage solutions, such as high-performance NFS, S3-compatible object storage, and distributed storage systems.
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources.
  • Document procedures and practices; perform technology evaluations related to distributed file systems.
  • Collaborate across teams to understand developer workflows and gather infrastructure requirements.
  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

Requirements

  • BS in Computer Science (or equivalent experience) with 8+ years of relevant experience; MS with 5+ years or Ph.D. with 3+ years of experience.
  • Deep experience with storage protocols such as NFS, NVMe/TCP, S3 and Lustre (LNet).
  • Experience with containerization technologies like Kubernetes and their integration with storage solutions.
  • Proficiency in one or more programming languages (Python, Go) is a must.
  • Experience with monitoring and configuration management tools such as Chef, Ansible, Puppet, Saltstack, etc.
  • Background with cloud infrastructure (AWS, Azure, or Google Cloud).
  • Experience with monitoring stacks such as Prometheus + Grafana and Elasticsearch + Kibana.
  • Excellent communication and collaboration skills.

Ways to Stand Out

  • Knowledge of HPC and AI solution technologies from CPUs and GPUs to high speed interconnects and supporting software.
  • Experience with RDMA (InfiniBand or RoCE) fabrics.
  • Background with HPC cluster management tools such as Slurm, PBS, LSF.
  • Passionate and experienced in AI methodologies.

Compensation & Benefits

  • Base salary ranges by level: Level 4: 168,000 USD - 264,500 USD; Level 5: 200,000 USD - 322,000 USD.
  • You will also be eligible for equity and benefits (link provided in original posting).
  • Applications for this job will be accepted at least until August 24, 2025.

Additional Information

  • #LI-Hybrid (hybrid work arrangement indicated)
  • NVIDIA is an equal opportunity employer committed to diversity and inclusion.