Senior Site Reliability Engineer, HPC And LSF

at Nvidia

📍 Santa Clara, United States

USD 184,000-287,500 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 CentOS @ 6 Docker @ 4 Linux @ 6 Python @ 6 Machine Learning @ 4 Communication @ 4 Networking @ 4 Perl @ 6 Performance Optimization @ 4 GPU @ 4

Details

NVIDIA is the leader in AI, machine learning, and datacenter acceleration expanding into datacenter networking with ethernet switches, NICs, and DPUs. With a legacy of innovation from GPU invention to deep learning, NVIDIA focuses on solving complex, impactful challenges. Join a diverse team fostering intellectual curiosity, collaboration, and risk-taking in a blame-free environment.

Responsibilities

Manage and support workload and resource schedulers in a large-scale HPC environment.
Automate deployment, configuration management, and operational monitoring through scripting.
Develop solutions for complex computing resource management needs.
Use grid performance metrics for troubleshooting and performance optimization.
Troubleshoot complex issues from bare metal to application level ensuring system reliability and efficiency.
Develop, define, and document standard methodologies for internal teams.
Collaborate with domain experts to enhance infrastructure usage in chip development.
Contribute to quality improvement and time-to-market acceleration for next-generation chips.

Requirements

Extensive knowledge of job scheduler administration (e.g., IBM Spectrum LSF or SLURM).
Proficiency in administering CentOS/RHEL Linux distributions.
In-depth understanding of container technologies such as Docker.
Proficiency in UNIX scripting languages and Python.
Strong problem-solving skills to analyze complex systems and implement scalable solutions.
Excellent communication and teamwork skills.
Over 10 years of experience in large, distributed Linux environments.
BS in Computer Science or equivalent experience.

Ways to Stand Out

Experience analyzing and tuning performance for HPC or EDA workloads.
Solid understanding of cluster configuration management tools such as Ansible.
Proficiency in Perl for legacy automation scripts maintenance.
Deep understanding of distributed system principles.

Benefits

Eligibility for equity and comprehensive benefits.
Committed to diversity and equal opportunity employment.

#LI-Hybrid