Senior Deep Learning Systems Engineer, Datacenters

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 4 Docker @ 4 Linux @ 4 Python @ 4 TensorFlow @ 4 Bash @ 4 Networking @ 4 Performance Monitoring @ 4 System Architecture @ 7 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

As NVIDIA makes inroads into the Datacenter business, this team is focused on maximizing performance and power efficiency of deep learning applications on datacenter-class hardware and establishing data-driven approaches to hardware design and system software development.

Responsibilities

  • Develop software infrastructure to characterize and analyze a broad range of Deep Learning applications.
  • Evolve cost-efficient datacenter architectures tailored to meet the needs of Large Language Models (LLMs).
  • Work with experts to develop analysis and profiling tools in Python, bash and C++ to measure key performance metrics of DL workloads running on NVIDIA systems.
  • Analyze system and software characteristics of DL applications (CPU, GPU, networking, IO interactions with DL workloads).
  • Develop analysis tools and methodologies to measure key performance metrics and estimate potential for efficiency improvement.

Requirements

  • Bachelor’s degree in Electrical Engineering or Computer Science or equivalent experience (Master's or PhD preferred).
  • 8 years or more of relevant experience.
  • Experience in at least one of:
    • System Software: Operating Systems (Linux), Compilers, GPU kernels (CUDA), DL Frameworks (PyTorch, TensorFlow).
    • Silicon Architecture and Performance Modeling/Analysis: CPU, GPU, Memory or Network Architecture.
  • Programming experience in C/C++ and Python. Exposure to bash scripting.
  • Exposure to containerization platforms (docker) and datacenter workload managers (slurm) is a plus.
  • Deep understanding of computer system architecture and performance analysis with demonstrated hands-on experience.
  • Demonstrated ability to work in virtual/multi-site environments and to take ownership from start to finish.

Ways to stand out

  • Background with system software, OS intrinsics, GPU kernels (CUDA), or DL frameworks (PyTorch, TensorFlow).
  • Experience with silicon performance monitoring or profiling tools (e.g., perf, gprof, nvidia-smi, dcgm).
  • In-depth performance modeling experience in CPU, GPU, Memory or Network Architecture.
  • Exposure to containerization platforms (docker) and datacenter workload managers (slurm).
  • Prior experience working with multi-site or cross-functional teams.

Compensation & Additional Info

  • Base salary ranges (determined by location, experience, and comparable roles):
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • Eligible for equity and benefits.
  • Applications accepted at least until July 29, 2025.
  • #LI-Hybrid

Benefits

  • NVIDIA benefits (details available on NVIDIA website).