Engineering Manager - Rack Scale AI Systems

at Nvidia

📍 Santa Clara, United States

USD 168,000-333,500 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 6 Linux @ 3 Distributed Systems @ 6 Communication @ 5 Android @ 3 Product Management @ 3 QA @ 3 CUDA @ 3

Details

NVIDIA is seeking an Engineering Manager to lead IPP's (Infrastructure, Planning and Process) Cloud Platform Team focused on Rack Scale AI Systems. The role oversees cloud services that run automated jobs across thousands of servers and supports a heterogeneous mix of machines and devices (Windows/Linux/Android) and hardware platforms including NVIDIA GPUs and Tegra processors. The team partners with multiple NVIDIA organizations (Graphics, Mobile, Deep Learning, AI, Driverless Cars) to meet infrastructure needs from concept to prototype to deployment.

Responsibilities

Build and lead an engineering organization focused on rack-scale systems onboarding and bring-up execution, engaging with internal and external partners.
Work with engineering, product management, and customer program management teams to define, prioritize, and implement features, infrastructure, processes, and workflows.
Work with NVIDIA product teams to understand new product requirements including HPC and AI/ML products.
Collaborate with cross-functional teams (system engineering, software engineering, mechanical/thermal engineering, operations, data center teams, external vendors) to deliver a reliable platform from concept to deployment.
Identify weaknesses in current processes and propose actions to improve quality.
Drive overall quality of deployments and improve time-to-market for next-generation products.
Lead on-ground teams in collecting data on SOL deployments, physical touch information, and failure patterns.
Drive triage and recovery execution during product bring-up and maintain support through the product sustaining phase.

Requirements

Bachelor's or Master's degree in Computer Science or Software Engineering, or equivalent experience.
5+ years of management experience in large, cross-matrix, geo-dispersed technology organizations focused on server and data center space; strong experience in Operations Product Engineering with 8+ years of overall experience.
Strong technical skills and understanding of embedded systems, orchestration & automation systems, data centers and cloud architecture.
Deep understanding of cloud design in areas such as virtualization, global infrastructure, distributed systems, load balancing, and security.
Excellent communication, planning, and collaborative interpersonal skills; proven ability to guide and influence diverse teams.
Strong thought process for identifying risks and developing robust mitigations.

Ways to stand out

Experience in large-scale QA environments for product bring-ups.
Experience with high-performance or large-scale computing environments, parallel computing, or CUDA.
Skills in large-scale/cluster computing (MPI), data center design including high-speed interconnects (InfiniBand), cluster storage, and scheduling-related design or management.
Experience with converged and hyper-converged hardware and servers.
Strong background in Windows and Linux administration.

Compensation & Benefits

Base salary ranges (determined by location, experience, and internal equity):
- Level 2: 168,000 USD - 270,250 USD
- Level 3: 208,000 USD - 333,500 USD
Eligible for equity and benefits (link to NVIDIA benefits provided in original posting).

Other details

Location: Santa Clara, CA, United States (posting lists US, CA, Santa Clara).
Time type: Full time.
Applications accepted at least until August 1, 2025.
NVIDIA is an equal opportunity employer and states a commitment to diversity and non-discrimination.