Senior Manager, CSP Engagements – System Software SWAT Team

at Nvidia

📍 Santa Clara, United States

USD 272,000-425,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Linux @ 4 Hiring @ 4 Leadership @ 4 Networking @ 4 Debugging @ 4 CUDA @ 3 GPU @ 3

Details

NVIDIA is seeking a Senior Manager to lead our System Software SWAT Team within CSP Engagements, focusing on data center platforms such as GB200/GB300 and next-generation systems. This elite, cross-functional group is the rapid-response hub for hyperscaler customers—running triage and war-rooms, operating customer-like labs to deliver golden repros, and driving issues from first signal to validated fix across firmware, Linux kernel / device drivers, networking, and virtualization. You will build and mentor the team, partner closely with CSP technical leaders and TPMs, and turn complex, high-visibility escalations into predictable, customer-validated outcomes that raise NVIDIA’s quality bar at hyperscale.

Responsibilities

Lead a cross-functional SWAT team focused on rapid triage, debugging, and resolution of complex system software issues for hyperscaler customers.
Drive technical incident response, war-room operations, and escalation management across firmware, Linux kernel, drivers, networking, virtualization, and observability layers.
Build and mentor a high-performing team of senior engineers; set operational standards for incident response, on-call rotations, and continuous improvement.
Serve as a primary technical and operational focal point for hyperscaler customers, managing expectations, communications, and participant relationships.
Collaborate with CSP technical leads, TPMs, and internal engineering teams to deliver customer-validated solutions and influence product quality and release criteria.
Operate customer-like labs to reproduce issues, validate fixes, and ensure robust telemetry and observability.
Provide executive-level status updates, risk assessments, and recommendations for critical customer issues.

Requirements

12+ overall years of proven experience in system software (firmware, Linux kernel, drivers, networking, virtualization), with at least 5 years in data center or HPC software environments.
Bachelor’s degree or equivalent experience.
Minimum 3+ years of direct experience working with hyperscalers in production environments.
6+ years of experience in management.
Proven leadership in managing customer escalations, technical incident response, and cross-functional teams.
Deep technical expertise in Linux kernel, device drivers, ARM (aarch64) & x86, OpenBMC/SBIOS, out-of-band/in-band management, DMTF protocols (Redfish, PLDM, MCTP, SPDM), and networking (TCP/IP, Ethernet, InfiniBand).
Strong customer management and team member engagement skills; ability to communicate complex technical issues to executive and engineering audiences.
Demonstrated success in reducing time-to-mitigation, improving release predictability, and driving continuous improvement in technical operations.

Ways to Stand Out / Preferred

Experience building and operating customer-like labs, automation, and telemetry frameworks.
Familiarity with GPU computing (CUDA), large-scale AI/HPC workloads, NVLink, Grace, and cluster-level deployment/management.
Knowledge of CXL/memory fabric fundamentals and contributions to industry standards (OCP, DMTF).

Compensation & Benefits

Base salary range: 272,000 USD - 425,500 USD (final base determined by location, experience, and pay of employees in similar positions).
Eligible for equity and benefits (link to NVIDIA benefits provided in original posting).

Additional Information

Applications accepted at least until December 17, 2025.
NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.