Principal Developer, AI Networking

at Nvidia

📍 Santa Clara, United States

USD 272,000-488,800 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 6 Distributed Systems @ 4 TensorFlow @ 7 Bash @ 6 Communication @ 4 Networking @ 4 Debugging @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 Profiling @ 4 NCCL @ 8 Performance Analysis @ 4

Details

NVIDIA is seeking a senior software engineer to join the AI Networking Codesign and Benchmarking R&D group to profile, analyze, and optimize AI workloads on large-scale GPU and CPU clusters used for distributed deep learning (LLM) training and inference. The role focuses on collective communication and high-performance networking across hardware (HCAs, switches, CPUs, GPUs, systems) and software (LLM applications, ML frameworks, communication and compute libraries). You will build performance analysis tools and strategies to investigate performance expectations, limitations, and bottlenecks.

Responsibilities

Characterize AI workloads and deep learning models for large-scale LLM training and inference on NVIDIA supercomputers, with emphasis on distributed systems and high-performance networking.
Benchmark, profile, and analyze performance to find bottlenecks and identify opportunities for improvement and optimization, emphasizing networking.
Develop PyTorch trace-based profiling, analysis, and replay toolsets to aid benchmarking, debugging, and co-design of network systems for LLM workloads.
Collaborate across hardware and software teams to provide performance analysis insights.
Define performance test plans, set performance expectations for new technologies/solutions, and work to achieve performance targets.

Requirements

B.Sc. in Computer Science, Software Engineering, or equivalent experience.
15+ years of experience with high-performance networking (RDMA, MPI, NCCL, SHARP).
Demonstrated ability in performance evaluation techniques and approaches.
Experience with NVIDIA GPUs and the CUDA library.
Knowledge of deep learning frameworks such as TensorFlow or PyTorch; experience specifically referenced with PyTorch profiling and trace-based toolsets.
Expertise in networking collective communication libraries (NCCL) and protocols like RoCE and RDMA.
Proficiency in programming languages: Python, Bash, and C++.
Experience with container-based development environments.
Strong analytical and problem-solving skills; able to learn quickly and work autonomously and collaboratively.

Preferred / Ways to Stand Out

Extensive hands-on experience with AI workloads and benchmarking for distributed LLM training.
Deep knowledge of PyTorch, CUDA, and NCCL libraries.
Comprehensive system knowledge (Intel/AMD/ARM CPUs, NVIDIA GPUs, HCA, memory, PCI).
Strong capabilities in performance evaluation using contemporary tools and methods.

Compensation & Benefits

Base salary ranges (location, experience, and level dependent):
- Level 6: 272,000 USD - 431,250 USD
- Level 7: 320,000 USD - 488,750 USD
Eligible for equity and a generous benefits package.

Applications for this job will be accepted at least until June 16, 2026. NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.