Senior Platform and EngOps Engineer - Cluster Operations
at Nvidia
USD 144,000-270,200 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 Linux @ 6 DevOps @ 4 Python @ 4 Communication @ 4 Networking @ 6 GPU @ 4Details
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. Join a team of engineers who develop and maintain software facilitating GPU communication and drive solutions in High Performance Computing and Deep Learning. This role focuses on EngOps and platform engineering to manage and maintain large GPU clusters interconnected via NVLink and InfiniBand.
Responsibilities
- Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
- Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.
- Take ownership of daily cluster failures and issues; troubleshoot promptly to maintain optimal cluster availability and performance.
- Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.
- Collaborate effectively with Engineering and Product teams across multiple time zones to align cluster operations with evolving project requirements.
Requirements
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
- 5+ years of hands-on experience deploying and administrating clusters, servers, switches, and related infrastructure.
- Automation expertise with hands-on skills in Ansible, Python, and shell scripting.
- Deep understanding of operating systems, computer networks, and high-performance applications.
- Proficient with Linux fundamentals.
- Proven ability to work effectively with developers and test engineers across different teams and time zones.
Ways to stand out
- Familiarity with resource scheduling managers, preferably Slurm.
- Direct experience with industry-standard alerting tools and emergency response practices.
- Hands-on experience with GPU-focused hardware and software, such as DGX systems and compute clusters.
- Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure.
- Proficiency in designing large-scale networking technologies and addressing associated challenges.
Compensation & Benefits
- Base salary range (depending on level and location):
- Level 3: 144,000 USD - 230,000 USD
- Level 4: 168,000 USD - 270,250 USD
- You will also be eligible for equity and benefits. (See: https://www.nvidia.com/en-us/benefits/)
Additional information
- Applications for this job will be accepted at least until October 14, 2025.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.