Senior AI Networking Performance Research And Analysis Engineer
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Linux @ 4 Python @ 4 Algorithms @ 6 Distributed Systems @ 4 TensorFlow @ 4 Bash @ 4 Communication @ 4 Networking @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4Details
Intelligent machines powered by Artificial Intelligence computers that can learn, reason and interact with people are no longer science fiction. GPU Deep Learning provides the foundation for machines to learn, perceive, reason and solve problems. Visual computing is crucial in helping people get along with technology, and NVIDIA extends this technology into datacenters, mobile devices, and cars.
Responsibilities
- Explore and research AI workloads and deep learning (DL) models tailored for large-scale LLM training on NVIDIA supercomputers and distributed systems, focusing on high-performance networking and Nvidia Collective Communications Library (NCCL).
- Benchmark, profile, and analyze performance to identify bottlenecks and areas for optimization, emphasizing networking aspects.
- Implement performance analysis tools.
- Collaborate with hardware and software teams to provide performance analysis insights.
- Define performance test planning, set performance expectations for new technologies, and work to reach performance target limits.
Requirements
- B.Sc. in Computer Science, Software Engineering, or equivalent experience.
- 5+ years of experience with high-performance networking (RDMA, MPI, NCCL, Congestion Control Algorithms).
- Demonstrated performance analysis skills and methodologies.
- Experience with NVIDIA GPUs, CUDA library, deep learning frameworks such as TensorFlow or PyTorch, and expertise in networking collective communication libraries (like NCCL) and protocols (e.g., RoCE and RDMA).
- Strong analytical and problem-solving skills with fast self-learning capability.
- Programming skills in Python, Bash, and C.
- Experience with Linux OS distributions.
- Strong communication and interpersonal skills.
Ways to Stand Out
- In-depth knowledge and experience with AI workloads and benchmarking for distributed LLM training.
- Knowledge of CUDA and NCCL libraries.
- Understanding of congestion control algorithms.
- Deep system knowledge across CPUs (Intel, AMD, ARM), NVIDIA GPUs, HCAs, memory, PCI.
- Expertise in performance analysis using modern tools.
About NVIDIA
NVIDIA has a legacy of innovation in computer graphics, PC gaming, and accelerated computing for over 25 years. The company is leveraging AI to define the next computing era where GPUs act as the brains of computers, robots, and self-driving cars. NVIDIA offers competitive salaries and comprehensive benefits in a diverse, supportive work environment.
#LI-Hybrid