Senior System Software Engineer - AI Performance And Efficiency Tools

at Nvidia
USD 184,000-356,500 per year
SENIOR
βœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 6 Kubernetes @ 4 Linux @ 4 Python @ 7 TensorFlow @ 4 Communication @ 7 Networking @ 4 Debugging @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

A key part of NVIDIA's strength is our sophisticated analysis and debugging tools that empower engineers to improve performance and power efficiency of our products and the running applications. This role involves developing tools for AI researchers and software/hardware teams running AI workloads in GPU clusters. You will collaborate with Architecture and Software teams to deliver intuitive, rich, and accurate insights into workloads and systems to find software and hardware opportunities, build high-level models, and debug tricky failures to improve system performance and efficiency.

Responsibilities

  • Build internal profiling and analysis tools for AI workloads at large scale
  • Build debugging tools for commonly encountered problems such as memory or networking issues
  • Create benchmarking and simulation technologies for AI systems or GPU clusters
  • Partner with hardware architects to propose new features or improve existing features using real-world use cases
  • Work with multiple global groups and be a customer-facing problem solver

Requirements

  • BS or higher in Computer Science or a related field (or equivalent experience) and 5+ years of software development experience
  • Strong software skills in design and coding (C++ and Python), plus analytical and debugging abilities
  • Good understanding of deep learning frameworks such as PyTorch and TensorFlow, including distributed training and inference
  • Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage, and networking
  • Experience with NVIDIA GPUs, CUDA programming, and NCCL
  • Motivated self-starter with strong problem-solving skills and customer-facing communication skills
  • Passion for continuous learning and ability to work concurrently with multiple global groups

Ways to stand out

  • Proven experience in GPU cluster-scale continuous profiling and analysis tools/platforms
  • Solid experience in large AI job performance analysis for training/inference workloads
  • Knowledge of Linux device drivers and/or compiler implementation
  • Knowledge of GPU and/or CPU architecture and general computer architecture principles

Benefits & Compensation

  • Base salary range (Location & level dependent):
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • Eligible for equity and company benefits (see NVIDIA benefits)

Additional information

  • Location: Santa Clara, California, United States
  • Office policy: Hybrid (#LI-Hybrid)
  • Applications accepted at least until July 29, 2025
  • Employer: NVIDIA (equal opportunity employer)