Software Manager, AI Infrastructure System

at Nvidia
USD 224,000-425,500 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 7 Distributed Systems @ 3 Communication @ 3 Debugging @ 3 API @ 2 LLM @ 3 GPU @ 3

Details

NVIDIA is seeking an AI Infrastructure System Software Manager to improve HPC and AI infrastructure that powers research and production LLM-based solutions. You will lead a team that builds and operates sophisticated infrastructure and tools for business-critical services and AI applications, collaborating across research and infrastructure teams to serve a broad engineering user base.

Responsibilities

  • Mentor, grow, and develop a world-class team of AI infrastructure engineers.
  • Work across multiple teams and orgs to build products that use LLMs and agent systems to serve NVIDIA engineering teams (hardware and software).
  • Align priorities across collaborators and define metrics to measure product and team success.
  • Develop and execute strategies for scalable, reliable, and secure AI infrastructure supporting research and production workloads.
  • Ensure robust monitoring, logging, visualization, and alerting to guarantee uptime and operational excellence.
  • Architect, design, develop, and maintain infrastructure and large-scale applications for LLM-based solutions, optimizing for performance, scalability, reliability, and secure data management.
  • Stay current with AI/ML and infrastructure trends and integrate advancements into NVIDIA's LLM and AI infrastructure solutions.

Requirements

  • 10+ years of industry experience in large distributed system software development.
  • BS or higher in Computer Science or equivalent experience.
  • 5+ years of experience managing AI and software development teams.
  • Familiarity with modern software development stacks and tools, including containerization, cloud or on-premises deployments, API integration, and real-time processing frameworks.
  • Experience developing and maintaining LLM or Generative AI infrastructure.
  • Hands-on experience developing large-scale distributed systems.
  • Excellent communication, collaboration, and problem-solving skills; commitment to inclusive and diverse workplaces.

Ways to stand out

  • Strong technical background in cloud/distributed infrastructure.
  • Experience debugging functional and performance issues in HPC GPU clusters.
  • Background in running and instrumenting distributed LLM training on multi-GPU HPC clusters.
  • Experience with HPC schedulers such as Slurm.

Compensation & Logistics

About NVIDIA

NVIDIA has a long history of GPU innovation and is a leader in GPU deep learning, modern AI, and HPC. The company emphasizes tackling hard problems that matter and offers competitive salaries and comprehensive benefits. NVIDIA is an equal opportunity employer committed to diversity and inclusion.