Senior Software Engineer - Parallel Computing Systems

at Nvidia

📍 Santa Clara, United States

USD 148,000-235,800 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Algorithms @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 4 Parallel Programming @ 7 Performance Optimization @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

Do you have expertise in CUDA kernel optimization, C++ systems programming, or compiler infrastructure? Join NVIDIA's nvFuser team to build the next-generation fusion compiler that automatically optimizes deep learning models for workloads scaling to thousands of GPUs. The Deep Learning Frameworks Team at NVIDIA is responsible for building nvFuser, an advanced compiler at the intersection of compiler technology and high-performance computing. You'll work closely with the PyTorch Core team and collaborate with Lightning-AI/Thunder to accelerate PyTorch workloads. The role involves collaborating with hardware architects, framework maintainers, and optimization experts to create compiler infrastructure that advances GPU performance and automates manual optimization techniques.

Responsibilities

Design algorithms that generate highly optimized code from deep learning programs.
Build GPU-aware CPU runtime systems that coordinate kernel execution for maximum performance.
Debug performance bottlenecks in thousand-GPU distributed systems and develop systematic optimization strategies.
Work directly with NVIDIA hardware engineers and optimization specialists to influence next-generation hardware and compiler optimizations.

Requirements

MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience).
2+ years advanced C++ programming experience with large codebase development, template metaprogramming, and performance-critical code.
Strong parallel programming experience with multi-threading and technologies such as OpenMP, CUDA, MPI, NCCL, NVSHMEM, or other parallel computing technologies.
Demonstrated experience in low-level performance optimization and systematic bottleneck identification (beyond basic profiling).
Performance analysis skills: analyzing high-level programs to identify bottlenecks and develop optimization strategies.
Collaborative problem-solving approach, adaptability in ambiguous situations, first-principles thinking, and strong ownership.
Excellent verbal and written communication skills.

Ways to stand out

Experience with HPC/scientific computing: CUDA optimization, GPU programming, numerical libraries (cuBLAS, NCCL), or distributed computing.
Compiler engineering background: LLVM, GCC, domain-specific language design, program analysis, or IR transformations and optimization passes.
Deep technical foundation in CPU/GPU architectures, numeric libraries, modular software design, or runtime systems.
Experience with large software projects, performance profiling, and a demonstrated track record of rapid learning.
Expertise with distributed parallelism techniques, tensor operations, auto-tuning, or performance modeling.

Compensation & Benefits

Base salary range: 148,000 USD - 235,750 USD (determined based on location, experience, and comparable pay).
Eligible for equity and benefits.

Additional Information

Applications for this job will be accepted at least until August 29, 2025.
NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.