Senior Site Reliability Engineer, HPC And LSF

at Nvidia
USD 184,000-287,500 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 CentOS @ 6 Docker @ 4 Linux @ 6 Python @ 6 Machine Learning @ 4 Communication @ 4 Networking @ 4 Perl @ 6 Performance Optimization @ 4 GPU @ 4

Details

NVIDIA is the leader in AI, machine learning, and datacenter acceleration expanding into datacenter networking with ethernet switches, NICs, and DPUs. With a legacy of innovation from GPU invention to deep learning, NVIDIA focuses on solving complex, impactful challenges. Join a diverse team fostering intellectual curiosity, collaboration, and risk-taking in a blame-free environment.

Responsibilities

  • Manage and support workload and resource schedulers in a large-scale HPC environment.
  • Automate deployment, configuration management, and operational monitoring through scripting.
  • Develop solutions for complex computing resource management needs.
  • Use grid performance metrics for troubleshooting and performance optimization.
  • Troubleshoot complex issues from bare metal to application level ensuring system reliability and efficiency.
  • Develop, define, and document standard methodologies for internal teams.
  • Collaborate with domain experts to enhance infrastructure usage in chip development.
  • Contribute to quality improvement and time-to-market acceleration for next-generation chips.

Requirements

  • Extensive knowledge of job scheduler administration (e.g., IBM Spectrum LSF or SLURM).
  • Proficiency in administering CentOS/RHEL Linux distributions.
  • In-depth understanding of container technologies such as Docker.
  • Proficiency in UNIX scripting languages and Python.
  • Strong problem-solving skills to analyze complex systems and implement scalable solutions.
  • Excellent communication and teamwork skills.
  • Over 10 years of experience in large, distributed Linux environments.
  • BS in Computer Science or equivalent experience.

Ways to Stand Out

  • Experience analyzing and tuning performance for HPC or EDA workloads.
  • Solid understanding of cluster configuration management tools such as Ansible.
  • Proficiency in Perl for legacy automation scripts maintenance.
  • Deep understanding of distributed system principles.

Benefits

  • Eligibility for equity and comprehensive benefits.
  • Committed to diversity and equal opportunity employment.

#LI-Hybrid