Senior Platform and EngOps Engineer - Cluster Operations
at Nvidia
š Santa Clara, United States
USD 176,000-333,500 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Ansible @ 4
Linux @ 6
DevOps @ 4
Python @ 4
Hiring @ 4
Communication @ 4
Networking @ 4
GPU @ 4
Deep Learning @ 4
AI @ 4
InfiniBand @ 4
Slurm @ 3
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is hiring EngOps and Platform Engineers to develop and maintain software that facilitates GPU communication and to manage large GPU clusters interconnected via NVLink and InfiniBand. The role focuses on improving execution efficiency, automating cluster operations, and ensuring high availability and performance for high-performance computing and deep learning workloads.
Responsibilities
- Develop automated tools to deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
- Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability.
- Own and troubleshoot daily cluster failures and issues to maintain optimal cluster availability and performance.
- Manage rollout and rollback of cluster software and firmware updates, ensuring minimal disruption.
- Collaborate with Engineering and Product teams across multiple time zones to align cluster operations with project requirements.
Requirements
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
- 8+ years of hands-on experience deploying and administering clusters, servers, switches, and related infrastructure.
- Automation expertise with hands-on skills in Ansible, Python, and shell scripting.
- Deep understanding of operating systems, computer networks, and high-performance applications.
- Proven ability to work effectively with developers and test engineers across teams and time zones.
- Proficient with Linux fundamentals.
Ways to stand out
- Familiarity with resource scheduling managers (preferably Slurm).
- Experience with industry-standard alerting tools and emergency response practices.
- Hands-on experience with GPU-focused hardware and software (e.g., NVIDIA DGX systems and compute clusters).
- Experience designing metrics collection and alerting infrastructure.
- Experience designing large-scale networking technologies and addressing associated challenges.
Compensation & Benefits
- Base salary ranges (determined by location, experience, and internal pay benchmarks):
- Level 4: 176,000 USD - 276,000 USD
- Level 5: 208,000 USD - 333,500 USD
- Eligible for equity and other benefits (see company benefits page).
Additional information
- Applications accepted at least until March 20, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer committed to diversity and inclusion.