Senior Software Architect - Deep Learning And HPC Communications
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Linux @ 7 Algorithms @ 4 TensorFlow @ 4 Communication @ 4 Networking @ 4 Parallel Programming @ 4 Debugging @ 4 System Architecture @ 7 PyTorch @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is leading groundbreaking developments in Artificial Intelligence, High Performance Computing, and Visualization. The GPU, NVIDIA's invention, serves as the visual cortex of modern computers and is central to NVIDIA's products and services. The work enables new creative and discovery possibilities and powers advanced inventions like AI for autonomous cars.
The GPU communication libraries such as NCCL, NVSHMEM & GPUDirect are crucial for scaling Deep Learning and HPC applications. The role is for a Senior Software Architect to co-design next-generation data center platforms and scalable communications software.
DL and HPC applications demand significant compute resources and operate at scales of tens of thousands of GPUs. GPUs connect via high-speed interconnects within nodes and networking across nodes. Efficient communication between GPUs is critical for application performance and grows in importance with system scale.
Responsibilities
- Investigate communication performance bottlenecks in current systems.
- Design and implement new communication technologies to accelerate AI and HPC workloads.
- Explore innovative hardware and software solutions for next-gen platforms in collaboration with GPU, Networking, and Software architects.
- Build proofs-of-concept, conduct experiments, and perform quantitative modeling to drive innovations.
- Use simulation to evaluate performance of large GPU clusters (hundreds of thousands of GPUs).
Requirements
- M.S./Ph.D. degree in Computer Science, Computer Engineering, or equivalent experience.
- 5+ years of relevant professional experience.
- Excellent programming and debugging skills in C/C++.
- Experience with parallel programming models such as MPI, SHMEM, and communication runtimes like MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC.
- Deep understanding of operating systems and system architecture.
- Solid fundamentals of network architecture, topology, algorithms, and communication scaling relevant to AI and HPC.
- Strong experience with Linux.
- Ability to work effectively in a multinational, multi-time-zone corporate environment.
Preferred Qualifications
- Expertise and passion for related technologies.
- Experience with CUDA programming and NVIDIA GPUs.
- Knowledge of high-performance networks like InfiniBand, RoCE, and NVLink.
- Experience with Deep Learning frameworks such as PyTorch and TensorFlow.
- Knowledge of deep learning parallelism and its mapping to communication subsystems.
- Experience with HPC applications.
- Strong collaborative and interpersonal skills, with proven ability to influence in dynamic environments.
NVIDIA offers highly competitive salaries, comprehensive benefits, and a diverse and inclusive work environment.