Senior Platform and EngOps Engineer - Cluster Operations

at Nvidia
USD 144,000-270,200 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Linux @ 6 DevOps @ 4 Python @ 4 Communication @ 4 Networking @ 4 GPU @ 4

Details

NVIDIA is seeking EngOps and Platform Engineers to develop and maintain software and automation that facilitate GPU communication and manage large GPU clusters interconnected via NVLink and InfiniBand. The role focuses on improving execution efficiency, maintaining cluster availability and performance, and collaborating with engineering and product teams across time zones.

Responsibilities

  • Develop automated tools to deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
  • Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability.
  • Take ownership of daily cluster failures and issues; troubleshoot promptly to maintain optimal availability and performance.
  • Manage rollout and rollback of cluster software and firmware updates to ensure smooth transitions with minimal disruption.
  • Collaborate with Engineering and Product teams across multiple time zones to align cluster operations with project requirements.

Requirements

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
  • 5+ years of hands-on experience deploying and administrating clusters, servers, switches, and related infrastructure.
  • Automation expert with hands-on skills in Ansible, Python, and shell scripting.
  • Deep understanding of operating systems, computer networks, and high-performance applications.
  • Proficiency with Linux fundamentals.
  • Proven ability to work effectively with developers and test engineers across different teams and time zones.

Preferred / Ways to stand out

  • Familiarity with resource scheduling managers, preferably Slurm.
  • Direct experience with industry-standard alerting tools and emergency response practices.
  • Hands-on experience with GPU-focused hardware and software (for example, DGX systems and compute clusters).
  • Proficiency in designing and implementing a robust metrics collection and alerting infrastructure.
  • Experience designing large-scale networking technologies and addressing associated challenges.

Compensation & Benefits

  • Base salary ranges by level:
    • Level 3: 144,000 USD - 230,000 USD
    • Level 4: 168,000 USD - 270,250 USD
  • Eligibility for equity and company benefits (see NVIDIA benefits page).

Additional Information

  • Applications accepted at least until August 13, 2025.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.