Senior Manager, CSP Engagements – System Software SWAT Team
at Nvidia
USD 272,000-425,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Linux @ 4 Hiring @ 4 Leadership @ 4 Networking @ 4 Debugging @ 4 CUDA @ 3 GPU @ 3Details
NVIDIA is seeking a Senior Manager to lead our System Software SWAT Team within CSP Engagements, focusing on data center platforms such as GB200/GB300 and next-generation systems. This elite, cross-functional group is the rapid-response hub for hyperscaler customers—running triage and war-rooms, operating customer-like labs to deliver golden repros, and driving issues from first signal to validated fix across firmware, Linux kernel / device drivers, networking, and virtualization. You will build and mentor the team, partner closely with CSP technical leaders and TPMs, and turn complex, high-visibility escalations into predictable, customer-validated outcomes that raise NVIDIA’s quality bar at hyperscale.
Responsibilities
- Lead a cross-functional SWAT team focused on rapid triage, debugging, and resolution of complex system software issues for hyperscaler customers.
- Drive technical incident response, war-room operations, and escalation management across firmware, Linux kernel, drivers, networking, virtualization, and observability layers.
- Build and mentor a high-performing team of senior engineers; set operational standards for incident response, on-call rotations, and continuous improvement.
- Serve as a primary technical and operational focal point for hyperscaler customers, managing expectations, communications, and participant relationships.
- Collaborate with CSP technical leads, TPMs, and internal engineering teams to deliver customer-validated solutions and influence product quality and release criteria.
- Operate customer-like labs to reproduce issues, validate fixes, and ensure robust telemetry and observability.
- Provide executive-level status updates, risk assessments, and recommendations for critical customer issues.
Requirements
- 12+ overall years of proven experience in system software (firmware, Linux kernel, drivers, networking, virtualization), with at least 5 years in data center or HPC software environments.
- Bachelor’s degree or equivalent experience.
- Minimum 3+ years of direct experience working with hyperscalers in production environments.
- 6+ years of experience in management.
- Proven leadership in managing customer escalations, technical incident response, and cross-functional teams.
- Deep technical expertise in Linux kernel, device drivers, ARM (aarch64) & x86, OpenBMC/SBIOS, out-of-band/in-band management, DMTF protocols (Redfish, PLDM, MCTP, SPDM), and networking (TCP/IP, Ethernet, InfiniBand).
- Strong customer management and team member engagement skills; ability to communicate complex technical issues to executive and engineering audiences.
- Demonstrated success in reducing time-to-mitigation, improving release predictability, and driving continuous improvement in technical operations.
Ways to Stand Out / Preferred
- Experience building and operating customer-like labs, automation, and telemetry frameworks.
- Familiarity with GPU computing (CUDA), large-scale AI/HPC workloads, NVLink, Grace, and cluster-level deployment/management.
- Knowledge of CXL/memory fabric fundamentals and contributions to industry standards (OCP, DMTF).
Compensation & Benefits
- Base salary range: 272,000 USD - 425,500 USD (final base determined by location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits (link to NVIDIA benefits provided in original posting).
Additional Information
- Applications accepted at least until December 17, 2025.
- NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.