Senior Engineering Manager – AI Research Clusters

at Nvidia
USD 272,000-425,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

R @ 4 Leadership @ 6 Networking @ 4 GPU @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing, where GPUs act as the brains of computers, robots, and self-driving cars that understand the world. This role offers an opportunity to make a lasting impact by leading ambitious projects demonstrating world-class GPU technology to drive groundbreaking AI R&D.

Responsibilities

  • Lead the design and deployment of scalable storage systems optimized for AI workloads and high-performance compute clusters.
  • Drive readiness and operational enablement for upcoming hardware platforms, ensuring seamless integration and performance.
  • Coordinate development of internal tools for storage provisioning, usage traceability, and user self-service.
  • Guide evaluation and implementation of technologies for improving efficiency, reliability, and observability.
  • Collaborate with cross-functional teams to align storage architecture with GPU cluster and research needs.
  • Improve storage monitoring and metrics infrastructure to enable proactive management.
  • Modernize existing storage systems with improved quota management, compression, and automation.

Requirements

  • BS degree or equivalent experience.
  • 12+ years of relevant technical experience.
  • 5+ years of leadership experience.
  • Proven ability to lead engineering teams building infrastructure at scale, particularly combining storage and HPC.
  • Deep knowledge of distributed storage systems with experience in data access patterns and platform observability.
  • Familiarity with infrastructure deployment lifecycle from planning to operational readiness.
  • Strong understanding of aligning storage performance with compute needs, and measuring system behavior with real metrics.
  • Ability to guide teams through technical evaluations balancing rigor and pragmatism.

Ways to Stand Out

  • Experience with large-scale storage and networking in performance-sensitive environments like HPC, AI, or scientific computing.
  • Success building tools or automation for self-service, visibility, and governance in complex infrastructure.
  • Background in data observability and metrics correlation for performance, cost efficiency, or capacity forecasting.
  • Experience leading cross-functional technical evaluations or RFPs resulting in infrastructure deployments.
  • Contributions to storage architecture such as filesystem tuning, quota management, or data compression.

Benefits

  • Competitive pay and benefits.
  • Equity eligibility.
  • Commitment to diversity and equal opportunity employment.