Distinguished Software Architect - Deep Learning and HPC Communications
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 5 Go @ 3 Algorithms @ 6 TensorFlow @ 3 Communication @ 3 Networking @ 3 Parallel Programming @ 3 Debugging @ 6 System Architecture @ 6 PyTorch @ 3 CUDA @ 6 GPU @ 3Details
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.
We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication libraries like NCCL, NVSHMEM, UCX and UCC for Deep Learning and HPC. We are looking for a Distinguished Software Architect to help co-design our next generation data center platforms. DL and HPC applications have a huge compute demand already and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (e.g. NVLink, PCIe) within a node and with high-speed networking (e.g. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales. This role offers an opportunity to push the limits of state-of-the-art and deliver next-gen platforms.
Responsibilities
- Research new communication technologies (for example, expand the GPUDirect technology portfolio) and design new features for communication libraries.
- Propose innovative HW and SW solutions for next-generation platforms and co-design them with GPU, Networking, and SW architects to ensure seamless integration with software stacks.
- Drive decisions based on quantitative data from proof-of-concepts, technical analysis, and modeling.
- Drive adoption of new communication technologies across application verticals.
- Keep up with the latest deep learning research and collaborate with internal and external teams, including DL researchers and customers.
Requirements
- PhD in Computer Science, Computer Engineering or related field, or strong equivalent experience; 15+ years of relevant experience in academia or industry.
- Expertise in HPC and parallel programming models (MPI, SHMEM/OpenSHMEM).
- Experience with at least one communication runtime such as MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC.
- Strong knowledge of computer and system architecture, GPU architecture, and CUDA.
- Deep understanding of high performance networking: network technologies (Infiniband, Ethernet), network design, topologies, network debugging and performance analysis.
- Strong skills in areas such as ML/DL fundamentals (and how they tie to communications), parallel algorithms, fault tolerance and resiliency, competitive assessments, performance analysis and optimizations for parallel applications on large clusters.
- Experience developing applications using DL frameworks (PyTorch, TensorFlow).
- Programming fluency in C or C++ for systems software development.
- Flexibility to work and communicate effectively across hardware/software teams and time zones.
Ways to stand out
- Industry-recognized leader in HPC/DL communications with patents, publications, and conference talks or keynotes in relevant areas.
- Influential role in industry standards (e.g. MPI, OpenSHMEM) and open source software (e.g. PyTorch, UCX, Open MPI).
Benefits
- Base salary range (determined by location, experience, and internal pay equity): 308,000 USD - 471,500 USD.
- Eligibility for equity and other NVIDIA benefits (see NVIDIA benefits page).
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.
Other details
- Location: Santa Clara, California, United States.
- Employment type: Full time.
- Applications accepted at least until October 22, 2025.