Senior Software Engineer - Parallel Computing Systems

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Algorithms @ 4 Distributed Systems @ 4 Communication @ 4 Parallel Programming @ 7 Performance Optimization @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4

Details

Do you have expertise in CUDA kernel optimization, C++ systems programming, or compiler infrastructure? Join NVIDIA's nvFuser team to build the next-generation fusion compiler that automatically optimizes deep learning models for workloads scaling to thousands of GPUs. This role sits at the intersection of compiler technology and high-performance computing; you will collaborate with the PyTorch Core team and Lightning-AI/Thunder to accelerate PyTorch workloads, and work with hardware architects, framework maintainers, and optimization experts to create compiler infrastructure that advances GPU performance.

Responsibilities

  • Design algorithms that generate highly optimized code from deep learning programs.
  • Build GPU-aware CPU runtime systems that coordinate kernel execution for maximum performance.
  • Debug and remove performance bottlenecks in distributed systems that can scale to thousands of GPUs.
  • Collaborate with NVIDIA hardware engineers and optimization specialists to develop automated compiler optimizations from manual techniques.
  • Influence next-generation hardware design through performance analysis and close collaboration with architecture teams.

Requirements

  • MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field (or equivalent experience).
  • 4+ years of advanced C++ programming experience with large codebase development, template metaprogramming, and performance-critical code.
  • Strong parallel programming experience (multi-threading, OpenMP, CUDA, MPI, NCCL, NVSHMEM, or other parallel computing technologies).
  • Demonstrated experience with low-level performance optimization and systematic bottleneck identification (beyond basic profiling).
  • Performance analysis skills: experience analyzing high-level programs to identify performance bottlenecks and develop optimization strategies.
  • Collaborative problem-solving approach, adaptability in ambiguous situations, first-principles thinking, and ownership.
  • Excellent verbal and written communication skills.

Ways to stand out

  • Experience with HPC/scientific computing: CUDA optimization, GPU programming, numerical libraries (cuBLAS, NCCL), or distributed computing.
  • Compiler engineering background: LLVM, GCC, domain-specific language design, program analysis, or IR transformations and optimization passes.
  • Deep technical foundation in CPU/GPU architectures, numerical libraries, modular software design, or runtime systems.
  • Experience with large software projects, performance profiling, and a track record of rapid learning.
  • Expertise with distributed parallelism techniques, tensor operations, auto-tuning, or performance modeling.

Compensation & Application

  • Base salary ranges by level:
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • You will also be eligible for equity and benefits.
  • Applications accepted at least until July 29, 2025.

About NVIDIA

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We do not discriminate on the basis of any characteristic protected by law.