Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Hiring @ 4
Communication @ 4
Networking @ 4
API @ 4
LLM @ 3
PyTorch @ 3
CUDA @ 6
GPU @ 4
Deep Learning @ 4
AI @ 4
InfiniBand @ 4
vLLM @ 3
NCCL @ 4
TensorRT @ 3
NVLink @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is hiring a Senior GPU Networking Architect to join the networking software group and build and improve GPU communication kernels that link GPU computing with networking. The role focuses on developing GPU-resident communication primitives and device-side APIs, optimizing kernel efficiency and latency, and collaborating with software, hardware, and AI framework teams to co-design communication strategies for large-scale AI systems.
Responsibilities
- Build, implement, and optimize GPU communication kernels that underpin collective and point-to-point operations in large-scale AI systems.
- Leverage deep knowledge of GPU architecture (thread scheduling, memory hierarchy, execution pipelines) to improve kernel efficiency, minimize latency, and overlap computation with communication.
- Develop GPU-resident communication primitives and device-side APIs enabling fine-grained, kernel-initiated data movement across nodes and accelerators.
- Profile and tune GPU kernels end-to-end; identify bottlenecks at the intersection of compute, memory, and network, and drive targeted optimizations.
- Collaborate with network software, hardware, and AI framework teams to co-design communication strategies aligned with GPU execution patterns and emerging model architectures.
- Build proofs-of-concept, conduct experiments, and perform quantitative modeling to evaluate and validate new communication strategies before committing them to production.
- Contribute to the evolution of programming models that expose GPU-aware networking capabilities to application developers.
Requirements
- 5+ years of hands-on CUDA programming, including writing and optimizing non-trivial GPU kernels.
- M.Sc. or equivalent experience in computer science, computer engineering, or a closely related field.
- Strong understanding of GPU architecture fundamentals: warp scheduling, shared memory, L2 cache, memory coalescing, occupancy tuning, and asynchronous execution.
- Experience with systems-level C/C++ development in performance-critical environments.
- Familiarity with GPU data movement mechanisms such as GPUDirect RDMA and GPU-initiated communication.
- Ability to read and reason about GPU performance profiles (e.g., Nsight Compute, Nsight Systems) and translate observations into actionable optimizations.
- Strong collaboration skills in a multi-national, interdisciplinary environment.
Preferred / Ways to stand out
- Experience developing or optimizing communication kernels in libraries such as NCCL, NVSHMEM, or similar GPU-aware communication frameworks.
- Understanding of distributed deep learning parallelism techniques (data, tensor, pipeline, expert parallelism, and mixture-of-experts) and the communication patterns they impose on GPU kernels.
- Background in RDMA, InfiniBand, high-speed networking, and GPU system topology (NVLink, NVSwitch, PCIe) and their impact on communication kernel design.
- Experience with overlap techniques such as kernel pipelining, persistent kernels, or cooperative groups to hide communication latency behind compute.
- Proven experience evaluating and optimizing large-scale LLM training or inference workloads, including hands-on work with frameworks such as PyTorch, TensorRT-LLM, or vLLM, and familiarity with emerging serving architectures such as disaggregated serving.
About compensation and benefits
NVIDIA offers highly competitive salaries and a comprehensive benefits package. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
For Poland: The base salary range is 292,500 PLN - 507,000 PLN for Level 4, and 375,000 PLN - 650,000 PLN for Level 5.
More on benefits: www.nvidiabenefits.com/