Senior AI Infrastructure Engineer

at Nvidia

📍 Santa Clara, United States

$140,000-258,800 per year

SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 4 Ansible @ 4 Docker @ 4 Linux @ 4 DevOps @ 4 Python @ 4 Bash @ 4 Performance Optimization @ 4

Details

We are now seeking a Senior AI Infrastructure Engineer! NVIDIA’s Compute Architecture Group is growing our team of AI focused Infrastructure Engineers who run our internal cluster for accelerated AI and software development. As part of this team, you will help to manage a diverse cluster of GPU-accelerated systems. Your contributions will enable engineers to work efficiently with a wide variety of forward-looking hardware configurations as they vigilantly seek out opportunities for performance optimization and continuously deliver high quality software.

Responsibilities

  • Administer an NVIDIA Internal AI cluster composed of Linux systems ranging from the world’s most powerful servers to embedded systems.
  • Maintain the configuration of our resource management system (SLURM) to keep resource allocation efficient and aligned with organizational priorities.
  • Automate configuration management, software updates, and maintenance of system availability using modern DevOps tools (Ansible, Gitlab, etc.).
  • Plan and maintain new systems that support the NVIDIA Software stack.
  • Work directly with developers and hardware architects to debug issues, identify new requirements, and improve workflows.
  • Actively communicate with users and management regarding resource planning and allocation.

Requirements

  • 5+ years of previous experience deploying and administering large scale clusters, tuned for development efforts in AI.
  • MS in Computer Science, Computer Engineering, or EECE; or a BS (or equivalent experience).
  • Deep knowledge of distributed resource scheduling systems (Slurm (preferred), LSF, etc.).
  • Demonstrated ability to script in bash, and at least one high-level language (Python preferred).
  • Experience with container technologies (Docker, Singularity, etc.).
  • Deep understanding of operating systems, computer networks, and high-performance hardware.
  • Ability to work well with developers, hardware architects, & test engineers.
  • Passionate dedication to providing quality support for users.

Benefits

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.