Senior Platform and EngOps Engineer - Cluster Operations

at Nvidia
USD 176,000-333,500 per year
SENIOR
āœ… On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Ansible @ 4 Linux @ 6 DevOps @ 4 Python @ 4 Hiring @ 4 Communication @ 4 Networking @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 InfiniBand @ 4 Slurm @ 3

Details

NVIDIA is hiring EngOps and Platform Engineers to develop and maintain software that facilitates GPU communication and to manage large GPU clusters interconnected via NVLink and InfiniBand. The role focuses on improving execution efficiency, automating cluster operations, and ensuring high availability and performance for high-performance computing and deep learning workloads.

Responsibilities

  • Develop automated tools to deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
  • Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability.
  • Own and troubleshoot daily cluster failures and issues to maintain optimal cluster availability and performance.
  • Manage rollout and rollback of cluster software and firmware updates, ensuring minimal disruption.
  • Collaborate with Engineering and Product teams across multiple time zones to align cluster operations with project requirements.

Requirements

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
  • 8+ years of hands-on experience deploying and administering clusters, servers, switches, and related infrastructure.
  • Automation expertise with hands-on skills in Ansible, Python, and shell scripting.
  • Deep understanding of operating systems, computer networks, and high-performance applications.
  • Proven ability to work effectively with developers and test engineers across teams and time zones.
  • Proficient with Linux fundamentals.

Ways to stand out

  • Familiarity with resource scheduling managers (preferably Slurm).
  • Experience with industry-standard alerting tools and emergency response practices.
  • Hands-on experience with GPU-focused hardware and software (e.g., NVIDIA DGX systems and compute clusters).
  • Experience designing metrics collection and alerting infrastructure.
  • Experience designing large-scale networking technologies and addressing associated challenges.

Compensation & Benefits

  • Base salary ranges (determined by location, experience, and internal pay benchmarks):
    • Level 4: 176,000 USD - 276,000 USD
    • Level 5: 208,000 USD - 333,500 USD
  • Eligible for equity and other benefits (see company benefits page).

Additional information

  • Applications accepted at least until March 20, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer committed to diversity and inclusion.