Senior AI Networking Performance Research And Analysis Engineer

at Nvidia
πŸ“ Shanghai, China
USD 100,000-160,000 per year
SENIOR
βœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Linux @ 4 Python @ 4 Algorithms @ 6 Distributed Systems @ 4 TensorFlow @ 4 Bash @ 4 Communication @ 4 Networking @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

Intelligent machines powered by Artificial Intelligence computers that can learn, reason and interact with people are no longer science fiction. GPU Deep Learning provides the foundation for machines to learn, perceive, reason and solve problems. Visual computing is crucial in helping people get along with technology, and NVIDIA extends this technology into datacenters, mobile devices, and cars.

Responsibilities

  • Explore and research AI workloads and deep learning (DL) models tailored for large-scale LLM training on NVIDIA supercomputers and distributed systems, focusing on high-performance networking and Nvidia Collective Communications Library (NCCL).
  • Benchmark, profile, and analyze performance to identify bottlenecks and areas for optimization, emphasizing networking aspects.
  • Implement performance analysis tools.
  • Collaborate with hardware and software teams to provide performance analysis insights.
  • Define performance test planning, set performance expectations for new technologies, and work to reach performance target limits.

Requirements

  • B.Sc. in Computer Science, Software Engineering, or equivalent experience.
  • 5+ years of experience with high-performance networking (RDMA, MPI, NCCL, Congestion Control Algorithms).
  • Demonstrated performance analysis skills and methodologies.
  • Experience with NVIDIA GPUs, CUDA library, deep learning frameworks such as TensorFlow or PyTorch, and expertise in networking collective communication libraries (like NCCL) and protocols (e.g., RoCE and RDMA).
  • Strong analytical and problem-solving skills with fast self-learning capability.
  • Programming skills in Python, Bash, and C.
  • Experience with Linux OS distributions.
  • Strong communication and interpersonal skills.

Ways to Stand Out

  • In-depth knowledge and experience with AI workloads and benchmarking for distributed LLM training.
  • Knowledge of CUDA and NCCL libraries.
  • Understanding of congestion control algorithms.
  • Deep system knowledge across CPUs (Intel, AMD, ARM), NVIDIA GPUs, HCAs, memory, PCI.
  • Expertise in performance analysis using modern tools.

About NVIDIA

NVIDIA has a legacy of innovation in computer graphics, PC gaming, and accelerated computing for over 25 years. The company is leveraging AI to define the next computing era where GPUs act as the brains of computers, robots, and self-driving cars. NVIDIA offers competitive salaries and comprehensive benefits in a diverse, supportive work environment.

#LI-Hybrid