Senior System Software Engineer - AI Performance and Efficiency Tools

at Nvidia
USD 184,000-356,500 per year
SENIOR
āœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 6 Kubernetes @ 4 Linux @ 4 Python @ 7 TensorFlow @ 4 Communication @ 7 Networking @ 4 Debugging @ 4 PyTorch @ 4 CUDA @ 4

Details

A key part of NVIDIA's strength is our sophisticated analysis/debugging tools that empower NVIDIA engineers to improve performance and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing tools for AI researchers and software/hardware teams running AI workload in GPU clusters.

Responsibilities

  • Build internal profiling and analysis tools for AI workloads at large scale
  • Build debugging tools for commonly encountered problems like memory or networking
  • Create benchmarking and simulation technologies for AI systems or GPU clusters
  • Partner with hardware architects to propose new features or improve existing features with real-world use cases

Requirements

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
  • Strong software skills in design, coding (C++ and Python), analytical, and debugging
  • Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
  • Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
  • Experience with NVIDIA GPUs, CUDA Programming, and NCCL
  • Motivated self-starter with strong problem-solving skills and customer-facing communication skills
  • Passion for continuous learning. Ability to work concurrently with multiple global groups

Ways to stand out from the crowd:

  • Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
  • Solid experience in large AI job performance analysis for training/inference workloads
  • Knowledge of Linux device drivers and/or compiler implementation
  • Knowledge of GPU and/or CPU architecture and general computer architecture principles

Benefits

You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.