Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 CentOS @ 6 Docker @ 4 Linux @ 6 Python @ 6 Machine Learning @ 4 Communication @ 4 Networking @ 4 Perl @ 6 Performance Optimization @ 4 GPU @ 4Details
NVIDIA is the leader in AI, machine learning, and datacenter acceleration expanding into datacenter networking with ethernet switches, NICs, and DPUs. With a legacy of innovation from GPU invention to deep learning, NVIDIA focuses on solving complex, impactful challenges. Join a diverse team fostering intellectual curiosity, collaboration, and risk-taking in a blame-free environment.
Responsibilities
- Manage and support workload and resource schedulers in a large-scale HPC environment.
- Automate deployment, configuration management, and operational monitoring through scripting.
- Develop solutions for complex computing resource management needs.
- Use grid performance metrics for troubleshooting and performance optimization.
- Troubleshoot complex issues from bare metal to application level ensuring system reliability and efficiency.
- Develop, define, and document standard methodologies for internal teams.
- Collaborate with domain experts to enhance infrastructure usage in chip development.
- Contribute to quality improvement and time-to-market acceleration for next-generation chips.
Requirements
- Extensive knowledge of job scheduler administration (e.g., IBM Spectrum LSF or SLURM).
- Proficiency in administering CentOS/RHEL Linux distributions.
- In-depth understanding of container technologies such as Docker.
- Proficiency in UNIX scripting languages and Python.
- Strong problem-solving skills to analyze complex systems and implement scalable solutions.
- Excellent communication and teamwork skills.
- Over 10 years of experience in large, distributed Linux environments.
- BS in Computer Science or equivalent experience.
Ways to Stand Out
- Experience analyzing and tuning performance for HPC or EDA workloads.
- Solid understanding of cluster configuration management tools such as Ansible.
- Proficiency in Perl for legacy automation scripts maintenance.
- Deep understanding of distributed system principles.
Benefits
- Eligibility for equity and comprehensive benefits.
- Committed to diversity and equal opportunity employment.
#LI-Hybrid