Used Tools & Technologies
HPCRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 4
Performance Optimization @ 4
QA @ 4
LLM @ 4
PyTorch @ 4
GPU @ 4
Deep Learning @ 4
AI @ 4
vLLM @ 4
SGLang @ 4
Performance Analysis @ 4
JAX @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA's high-performance computing platforms power AI across many applications. CUTLASS is an open-source ecosystem for high-performance linear algebra and Tensor Core primitives, providing C++ and Python abstractions for custom matrix multiply (GEMM) and related deep learning computations on NVIDIA GPUs.
Responsibilities
- Benchmark the performance of state-of-the-art deep learning models' inference and training passes to identify key GPU kernel and fusion opportunities.
- Identify gaps between theoretical and realized performance, and suggest software improvements or model adjustments to resolve them.
- Develop tooling to automate the benchmarking, analysis, and performance optimization loop to push the limit of CUTLASS kernel performance within DL networks.
- Serve as the authoritative resource on kernel performance for the team and engage with GPU architecture, deep learning framework, and QA teams across NVIDIA as the CUTLASS performance representative.
Requirements
- Masters or PhD in Computer Science, Computer Engineering, or related field (or equivalent experience).
- 3+ years of relevant industry experience.
- Strong programming skills in Python and C++.
- Experience in software performance analysis and optimization.
- Deep understanding of computer architecture and familiarity with GPUs or similar parallel processing architectures.
Preferred / Ways to stand out
- Deep understanding of state-of-the-art deep learning model architectures.
- Hands-on experience with performance benchmarking of DL frameworks such as PyTorch, JAX, SGLang, vLLM, TRT-LLM, or others.
- Experience developing performance models and performance regression systems.
Compensation & Benefits
- Base salary ranges (determined by location, experience, and peer pay):
- Level 3: 152,000 USD - 241,500 USD per year
- Level 4: 184,000 USD - 287,500 USD per year
- Eligible for equity and additional benefits (link to NVIDIA benefits referenced in posting).
Additional information
- Applications accepted at least until June 5, 2026. This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to diversity and inclusion.