Used Tools & Technologies
Not specified
Required Skills & Competences ?
System Administration @ 4 Ansible @ 6 Kubernetes @ 4 Linux @ 4 Python @ 6 Bash @ 6 Networking @ 4 Reporting @ 4 Puppet @ 4 GPU @ 4Details
NVIDIA is looking for a Senior HPC Engineer to join its Infrastructure Specialists team. Academic, commercial, and government groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! NVIDIA seeks someone with excellent interpersonal skills to work on a dynamic customer-focused team. This role involves interacting with customers, partners, and internal teams to analyze, define, and implement large-scale AI/HPC projects. These efforts include networking, system design, automation, and validation.
Responsibilities
- Deploy, manage, and validate AI/HPC infrastructure in Linux-based environments for new and existing customers.
- Act as the domain expert with customers during planning calls through implementation.
- Create handover-related documentation and perform knowledge transfers to support customer rollouts.
- Provide feedback to internal teams by opening bugs, documenting workarounds, and suggesting improvements.
Requirements
- 5+ years providing in-depth support and deployment services; solving hardware and software product problems.
- Knowledge and experience with Linux system administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, and network routing/advanced networking (tuning and monitoring).
- Experience with cluster management technologies (bonus for Base Command Manager).
- Minimum of a four-year degree in Computer Science, Electrical or Computer Engineering, or equivalent experience.
- Scripting proficiency (Bash, Python, Ansible, etc.).
- Excellent interpersonal skills; ability to resolve customer issues, prioritize, and multitask with limited supervision.
- Experience with schedulers such as SLURM, LSF, UGE.
- Ability to travel up to 40% within the US to customer sites.
- Background in automation tooling (Ansible, Puppet).
- Experience with Kubernetes and benchmarking tools such as HPL, NCCL tests, MLPerf.
Ways to Stand Out
- InfiniBand experience.
- Experience with GPU-focused hardware/software.
- Experience with MPI (Message Passing Interface).
- Storage technologies like Lustre or GPFS.
- Familiarity with Dell and Supermicro GPU platforms.
Benefits
Base salary range is 148,000 USD - 276,000 USD, determined based on location, experience, and comparable employee pay. Eligible for equity and additional benefits.
NVIDIA is an equal opportunity employer committed to diversity and inclusion.