Distinguished Software Architect - Deep Learning and HPC Communications
at Nvidia
š Santa Clara, United States
USD 308,000-471,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 6 Algorithms @ 7 TensorFlow @ 7 Communication @ 4 Networking @ 4 Parallel Programming @ 4 Debugging @ 7 System Architecture @ 7 PyTorch @ 7 CUDA @ 7 GPU @ 4Details
NVIDIA's GPU Communications Libraries and Networking team delivers communication libraries (NCCL, NVSHMEM, UCX) for Deep Learning and High Performance Computing (HPC). This role will co-design next-generation data center platforms and communication features that scale to tens of thousands of GPUs and high-speed interconnects. You will work across GPU, networking, and software teams to drive performance and adoption of new communication technologies.
Responsibilities
- Research new communication technologies (for example, expand the GPUDirect technology portfolio) and design new features for communication libraries.
- Propose innovative HW and SW solutions for next-generation platforms and co-design with GPU, networking, and software architects to ensure seamless integration.
- Inspire changes based on quantitative data coming from proof-of-concepts or detailed technical analysis and modeling.
- Drive adoption of new communication technologies across application verticals.
- Keep up with the latest deep learning (DL) research and collaborate with diverse teams (internal and external), including DL researchers and customers.
Requirements
- PhD in Computer Science, Computer Engineering, or a related field ā or strong equivalent experience.
- 15+ years of relevant experience in academia or industry.
- Expertise in HPC and parallel programming models (MPI, SHMEM).
- Experience with at least one communication runtime: MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC.
- Strong knowledge of computer and system architecture, GPU architecture, and CUDA.
- Deep understanding of high performance networking: network technologies (InfiniBand, Ethernet), network design, topologies, debugging and performance analysis.
- Strong experience in some of these areas: ML/DL fundamentals and their relation to communications, parallel algorithms, fault tolerance and resiliency, competitive assessments, performance analysis and optimizations for parallel applications on large clusters, and developing applications using DL frameworks (PyTorch, TensorFlow).
- Programming fluency in C or C++ for systems software development.
- Flexibility to work and communicate effectively across hardware/software teams and time zones.
Ways to Stand Out
- Industry-recognized leader in HPC/DL communications with a history of patents, publications, conference talks, and keynotes in relevant areas.
- Influential role in industry standards (for example, MPI, OpenSHMEM) and open source projects (for example, PyTorch, UCX, Open MPI).
Compensation & Benefits
- Base salary range: 308,000 USD - 471,500 USD (final base salary determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and company benefits.
Additional Information
- Applications accepted at least until August 13, 2025.
- NVIDIA is an equal opportunity employer and values diversity in its workforce.