Senior Platform and EngOps Engineer - Cluster Operations

at Nvidia

📍 Santa Clara, United States

USD 144,000-270,200 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 Linux @ 6 DevOps @ 4 Python @ 4 Communication @ 4 Networking @ 6 GPU @ 4

Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. Join a team of engineers who develop and maintain software facilitating GPU communication and drive solutions in High Performance Computing and Deep Learning. This role focuses on EngOps and platform engineering to manage and maintain large GPU clusters interconnected via NVLink and InfiniBand.

Responsibilities

Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.
Take ownership of daily cluster failures and issues; troubleshoot promptly to maintain optimal cluster availability and performance.
Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.
Collaborate effectively with Engineering and Product teams across multiple time zones to align cluster operations with evolving project requirements.

Requirements

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
5+ years of hands-on experience deploying and administrating clusters, servers, switches, and related infrastructure.
Automation expertise with hands-on skills in Ansible, Python, and shell scripting.
Deep understanding of operating systems, computer networks, and high-performance applications.
Proficient with Linux fundamentals.
Proven ability to work effectively with developers and test engineers across different teams and time zones.

Ways to stand out

Familiarity with resource scheduling managers, preferably Slurm.
Direct experience with industry-standard alerting tools and emergency response practices.
Hands-on experience with GPU-focused hardware and software, such as DGX systems and compute clusters.
Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure.
Proficiency in designing large-scale networking technologies and addressing associated challenges.

Compensation & Benefits

Base salary range (depending on level and location):
- Level 3: 144,000 USD - 230,000 USD
- Level 4: 168,000 USD - 270,250 USD
You will also be eligible for equity and benefits. (See: https://www.nvidia.com/en-us/benefits/)

Additional information

Applications for this job will be accepted at least until October 14, 2025.
NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.