Senior HPC Storage Engineer

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

GPU

Required Skills & Competences

CentOS @ 6 Ceph @ 4 Docker @ 4 Linux @ 6 Python @ 6 TensorFlow @ 4 Leadership @ 4 Bash @ 6 Networking @ 7 HTTP @ 4 PyTorch @ 4 CUDA @ 6 Deep Learning @ 4 AI @ 4 NCCL @ 6 HPC @ 4 Performance Analysis @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. As a member of the HW Infrastructure Storage Strategy team, you will provide leadership in the research, design and implementation of fast storage solutions to enable demanding high performance computing and computationally intensive workloads. You will identify architectural changes encompassing file, block, and object storage to meet scaling and performance requirements of an expanding cloud infrastructure and help shape next‑gen storage strategy across a global computing environment.

Responsibilities

  • Research and analyze existing internal distributed storage services.
  • Research, design, and implement scalable, next‑gen distributed storage services for HPC workloads, optimizing both performance and cost‑effectiveness to meet NVIDIA's infrastructure needs.
  • Develop tooling to automate management of large‑scale infrastructure environments, automate operational monitoring and alerting, and enable self‑service consumption of resources.
  • Detail procedures and practices, perform technology evaluations related to distributed file systems.
  • Collaborate across teams to understand developers' workflows and capture infrastructure requirements.
  • Influence and guide methodologies for building, testing, and deploying applications to ensure efficient performance and resource utilization.
  • Support researchers to run flows on clusters including performance analysis and optimizations of deep learning workflows.
  • Perform root cause analysis and suggest corrective action for problems at small and large scales.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
  • 8+ years of experience designing and/or operating large scale storage infrastructure.
  • Experience analyzing and tuning storage performance for a variety of workloads.
  • Proficient in CentOS/RHEL and/or Ubuntu Linux distributions.
  • Proficient in Python programming and bash scripting.
  • In‑depth understanding of container technologies like Docker and Enroot.

Ways to stand out

  • Extensive experience with parallel and distributed filesystems (Ceph, Weka.io, Vast, Lustre, GPFS) and Linux storage kernel development.
  • Proficiency with NVIDIA GPUs, CUDA programming, and NCCL; experience with performance benchmarking (e.g., MLPerf).
  • Deep familiarity with storage hardware (HDDs, SSDs, NVMe), enclosures, and specialized appliances.
  • Strong background in Software Defined Networking (SDN) and high‑performance networking for AI/HPC clusters.
  • Practical experience with deep learning frameworks such as PyTorch and TensorFlow.

Compensation and benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits (see https://www.nvidia.com/en-us/benefits/ and http://www.nvidiabenefits.com/).

Additional information

  • Applications for this job will be accepted at least until March 26, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.