Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 6 Distributed Systems @ 6 Communication @ 3 Debugging @ 3 API @ 2 LLM @ 3 GPU @ 3Details
NVIDIA is seeking an AI Infrastructure System Software Manager to join the team improving HPC and AI infrastructure. The team builds and operates sophisticated infrastructure to enable business-critical services and AI applications, providing tools to build and manage that infrastructure. The ideal candidate is strong in software development, designing and creating reliable distributed systems, and can implement long-term maintenance strategies.
Responsibilities
- Mentor, grow, and develop a world-class team of AI infrastructure engineers.
- Work across several teams and organizations to build products that use LLMs and agent systems to serve NVIDIA engineering teams, collaborating with research and infra teams and serving a large user base (hardware/software teams across NVIDIA).
- Align priorities across collaborators and define metrics for measuring product/team success.
- Develop and execute strategies for scalable, reliable, and secure AI infrastructure supporting both research and production workloads.
- Ensure robust monitoring, logging, visualization, and alerting capabilities to guarantee uptime and operational excellence.
- Architect, design, develop, and maintain infrastructure and large-scale applications for LLM-based solutions, optimizing for performance, scalability, reliability, and secure data management.
- Stay updated with the latest trends in AI, ML, and infrastructure, and proactively seek opportunities to integrate advancements into NVIDIA's LLM and AI infrastructure solutions.
Requirements
- 10+ years of industry experience in large distributed system software development.
- BS or higher in Computer Science or related field, or equivalent experience.
- 5+ years of experience managing AI and software development teams.
- Familiarity with modern software stacks and tools, including containerization, cloud or on-premises deployments, API integration for model operation, and real-time processing frameworks.
- Experience developing and maintaining LLM or GenAI infrastructure.
- Hands-on experience developing large-scale distributed systems.
- Excellent communication, collaboration, and problem-solving skills, with dedication to an inclusive and diverse workplace.
Ways to stand out / Preferred
- Strong technical background in cloud/distributed infrastructure.
- Experience debugging functional and performance issues in HPC GPU clusters.
- Background in running and instrumenting distributed LLM training on a multi-GPU HPC cluster.
- Experience with HPC schedulers such as Slurm.
Compensation & Benefits
- Base salary ranges provided by location/level:
- Level 3: 224,000 USD - 356,500 USD
- Level 4: 272,000 USD - 425,500 USD
- Eligible for equity and benefits. See NVIDIA benefits pages for details.
Other
- Applications accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer committed to diversity and inclusion.