Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 7 System Administration @ 4 Kubernetes @ 4 Linux @ 4 Python @ 6 Networking @ 4 GPU @ 4Details
NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for over 25 years. Today NVIDIA is focused on AI and accelerated computing. Base Command Manager (BCM) powers thousands of clusters worldwide and streamlines cluster provisioning, workload management, and infrastructure monitoring. This role will support external customers running clusters with NVIDIA solutions as well as internal clusters used for research, operations, and next-generation projects.
Responsibilities
- Contribute to deployments and daily operations of large-scale next-generation GPU platforms.
- Handle incidents in GPU clusters, bridging the gap between cluster operations and development.
- Design and implement small features in the Base Command Manager product to become intimately familiar with the product internals.
- Validate complex cluster configurations (including Slurm and Kubernetes orchestrators) for performance, scalability, and resilience to meet real-world customer scenarios.
Requirements
- Bachelor’s Degree or equivalent experience in Computer Science or a related field.
- 8+ years of experience in site reliability engineering and/or software development roles.
- Fluency in Python.
- In-depth knowledge of Linux and networking.
Ways to stand out
- Experience with C++.
- Experience with high-performance computing (HPC).
- Experience with Kubernetes and/or system administration.
- Previous experience running BCM / Bright Cluster Manager / Base Command Manager clusters.
- Proficiency with cluster networking including InfiniBand and Spectrum-X.
Technologies and concepts mentioned
Python, Linux, networking, Slurm, Kubernetes, C++, high-performance computing (HPC), Bright Cluster Manager / Base Command Manager, GPU clusters, DGX Cloud, cluster provisioning, workload management, infrastructure monitoring, InfiniBand, Spectrum-X.
Compensation & benefits
- Base salary ranges provided in the posting:
- Level 4: 168,000 USD - 270,250 USD per year
- Level 5: 208,000 USD - 333,500 USD per year
- You will also be eligible for equity and additional benefits (link to NVIDIA benefits provided in the original posting).
Other information
- Applications for this job will be accepted at least until December 20, 2025.
- NVIDIA is an equal opportunity employer and promotes a diverse work environment.