Senior System Software Engineer - AI Performance and Efficiency Tools
at Nvidia
š Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 6 Kubernetes @ 4 Linux @ 4 Python @ 7 TensorFlow @ 4 Communication @ 7 Networking @ 4 Debugging @ 4 PyTorch @ 4 CUDA @ 4Details
A key part of NVIDIA's strength is our sophisticated analysis/debugging tools that empower NVIDIA engineers to improve performance and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing tools for AI researchers and software/hardware teams running AI workload in GPU clusters.
Responsibilities
- Build internal profiling and analysis tools for AI workloads at large scale
- Build debugging tools for commonly encountered problems like memory or networking
- Create benchmarking and simulation technologies for AI systems or GPU clusters
- Partner with hardware architects to propose new features or improve existing features with real-world use cases
Requirements
- BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
- Strong software skills in design, coding (C++ and Python), analytical, and debugging
- Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
- Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
- Experience with NVIDIA GPUs, CUDA Programming, and NCCL
- Motivated self-starter with strong problem-solving skills and customer-facing communication skills
- Passion for continuous learning. Ability to work concurrently with multiple global groups
Ways to stand out from the crowd:
- Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
- Solid experience in large AI job performance analysis for training/inference workloads
- Knowledge of Linux device drivers and/or compiler implementation
- Knowledge of GPU and/or CPU architecture and general computer architecture principles
Benefits
You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.