Senior Software Engineer, CUTLASS Performance

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

HPC

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 4 Performance Optimization @ 4 QA @ 4 LLM @ 4 PyTorch @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 vLLM @ 4 SGLang @ 4 Performance Analysis @ 4 JAX @ 4

Details

NVIDIA's high-performance computing platforms power AI across many applications. CUTLASS is an open-source ecosystem for high-performance linear algebra and Tensor Core primitives, providing C++ and Python abstractions for custom matrix multiply (GEMM) and related deep learning computations on NVIDIA GPUs.

Responsibilities

Benchmark the performance of state-of-the-art deep learning models' inference and training passes to identify key GPU kernel and fusion opportunities.
Identify gaps between theoretical and realized performance, and suggest software improvements or model adjustments to resolve them.
Develop tooling to automate the benchmarking, analysis, and performance optimization loop to push the limit of CUTLASS kernel performance within DL networks.
Serve as the authoritative resource on kernel performance for the team and engage with GPU architecture, deep learning framework, and QA teams across NVIDIA as the CUTLASS performance representative.

Requirements

Masters or PhD in Computer Science, Computer Engineering, or related field (or equivalent experience).
3+ years of relevant industry experience.
Strong programming skills in Python and C++.
Experience in software performance analysis and optimization.
Deep understanding of computer architecture and familiarity with GPUs or similar parallel processing architectures.

Preferred / Ways to stand out

Deep understanding of state-of-the-art deep learning model architectures.
Hands-on experience with performance benchmarking of DL frameworks such as PyTorch, JAX, SGLang, vLLM, TRT-LLM, or others.
Experience developing performance models and performance regression systems.

Compensation & Benefits

Base salary ranges (determined by location, experience, and peer pay):
- Level 3: 152,000 USD - 241,500 USD per year
- Level 4: 184,000 USD - 287,500 USD per year
Eligible for equity and additional benefits (link to NVIDIA benefits referenced in posting).

Additional information

Applications accepted at least until June 5, 2026. This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and committed to diversity and inclusion.