Senior High Performance Computing Cluster Administrator

at Nvidia

📍 Santa Clara, United States

$148,000-230,000 per year

SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Docker @ 4 Linux @ 4 DevOps @ 4 Python @ 4 Leadership @ 4 Bash @ 4 Networking @ 4 Perl @ 4

Details

NVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute cluster that runs demanding deep learning, high performance computing, and computationally intensive workloads. We are looking for an expert to identify architectural changes and/or completely innovative approaches for our GPU Compute Cluster. In this role, you will help us with the strategic challenges we encounter, including compute, networking, and storage design for large-scale, high-performance workloads and effective resource utilization in a heterogeneous compute environment.

Responsibilities

  • Administer Linux systems, ranging from powerful DGX servers to embedded systems, bringup hardware to publicly available systems.
  • Coordinate Storage Solutions and plan for growth.
  • Automate configuration management, software updates, and maintenance and monitoring of system availability using modern DevOps tools (Ansible, Gitlab, etc.)
  • Actively connect with management regarding any problems with the equipment and propose resolution.
  • Plan, build and install/upgrade new systems that support NVIDIA DL Software.

Requirements

  • You have a BA, BS, or MS in CS, EE, CE or equivalent experience.
  • 4+ years of previous experience deploying and administrating HPC clusters.
  • Familiar with resource scheduling managers (Slurm (preferred), LSF, etc.)
  • Proven track record to script in bash, Perl or python.
  • Experience with containers (Docker, Singularity, LXC).
  • Deep understanding of operating systems, computer networks, and high-performance applications.
  • Ability to work well with developers & test engineers.
  • Hard-working dedication to provide quality in support for your users.

Benefits

  • The base salary range is 148,000 USD - 230,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
  • You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.