Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 7 Grafana @ 3 Jenkins @ 2 Kubernetes @ 3 Linux @ 7 Prometheus @ 3 Python @ 5 CI/CD @ 2 Hiring @ 3 Leadership @ 3 Bash @ 5 Communication @ 3 Git @ 2 Networking @ 5 Salt @ 7 GPU @ 3Details
NVIDIA is hiring a Base Command Manager (BCM) Engineer to support product deployments, escalations, and collaboration between Engineering and the Field Organization for NVIS NPI. The role focuses on cluster deployment and management solutions for large-scale GPU platforms, ensuring successful new product introduction (NPI) and global customer/OEM deployments.
Responsibilities
- Act as the link between engineering and the NVIS field team for cluster deployment and management solutions.
- Collaborate with engineering and product teams to review and influence design decisions for products centered around BCM-managed clusters.
- Evaluate changes in BCM and underlying OS/software stacks and communicate impacts to the field organization to maintain scalable deployment workflows.
- Define and relay detailed cluster management requirements to engineering to support NPI of next-generation GPU platforms.
- Describe architectural and design changes and build clear, actionable tasks for the field, including standardized deployment guides, configuration methodologies, and validation workflows.
- Validate complex cluster configurations (including Slurm and Kubernetes orchestrators) for performance, scalability, and resilience against real-world customer scenarios.
- Support the NPI team by bridging knowledge gaps, tracking progress, and aligning collaborators throughout the product lifecycle.
- Ensure NVIDIA technologies are successfully deployed for global customers and OEM partners.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- 10+ years of experience in at least two of the following: HPC/large-scale cluster administration, Linux systems engineering, infrastructure automation (e.g., Ansible, Salt), or data center operations.
- 5+ years of direct, hands-on experience provisioning, managing, and troubleshooting clusters using NVIDIA Base Command Manager (BCM).
- Deep, practical knowledge of how Slurm and Kubernetes are coordinated, deployed, and managed by BCM, including workload submission and resource management.
- Proficiency in Python and Bash scripting for automation, cluster validation, and workflow optimization.
- In-depth experience with cluster management and monitoring tools (e.g., Prometheus, Grafana, DCGM, and similar observability stacks).
- Outstanding written and verbal communication skills; ability to explain complex technical concepts to both technical and non-technical collaborators.
- Customer-first attitude, self-motivation, and a proactive approach to leadership in diverse environments.
Ways to stand out
- Proficiency with cluster networking including InfiniBand and Spectrum-X.
- Experience with NVIDIA Mission Control.
- Familiarity with CI/CD workflows in an infrastructure context (Git, GitLab, Jenkins).
- Background in Professional Services, customer-facing deployment, and solutions optimization.
- Industry certifications such as CKA/CKAD, RHCE, or other advanced Linux/HPC credentials.
Compensation & Benefits
- Base salary ranges by level (determined by location, experience, and pay of employees in similar positions):
- Level 5: 176,000 USD - 276,000 USD
- Level 6: 208,000 USD - 327,750 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits).
- Applications for this job will be accepted at least until August 26, 2025.
About NVIDIA
NVIDIA has transformed computer graphics, PC gaming, and accelerated computing for more than 25 years and is now driving the next era of computing with AI. NVIDIA values diversity and is an equal opportunity employer.