AI Inference Performance Engineer

at Nvidia

📍 Santa Clara, United States

USD 152,000-241,500 per year

MIDDLE

✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 5 Kubernetes @ 3 Python @ 6 Algorithms @ 3 Leadership @ 3 Debugging @ 3 Technical Leadership @ 3 LLM @ 3 PyTorch @ 3 CUDA @ 3 GPU @ 3 Deep Learning @ 3 AI @ 3 Profiling @ 3 vLLM @ 3 GenAI @ 3 NCCL @ 3 TensorRT @ 3 SGLang @ 3 HPC @ 3

Details

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining industry performance standards across language models, video generation, and speech workloads. The team works directly with TensorRT-LLM, SGLang, and vLLM, building tools to evaluate serving performance at scale. This role sits at the intersection of GPU performance engineering and public accountability.

Responsibilities

Drive industry benchmark results: own the end-to-end optimization pipeline and implement/integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
Architect distributed inference: design and optimize execution from single-GPU to rack-scale clusters, managing performance across GPU clusters.
Establish performance methodology: apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.
Technical leadership: raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team.

Requirements

BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
5+ years of relevant software development experience.
Strong Python or C++ programming, software design, and software engineering skills.
Expertise with a deep-learning framework such as PyTorch or JAX.
Proven track record of delivering measurable performance improvements in deep learning inference or high-performance systems.
Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode-phase bottlenecks, speculative decoding, disaggregated serving, etc.

Ways To Stand Out

Prior experience with an LLM framework (TensorRT-LLM, vLLM, SGLang, etc.) or a DL compiler in inference, deployment, algorithms, or implementation.
Prior experience with performance modeling, profiling, debugging, and code optimization of DL/HPC/high-performance applications.
Experience with scale-out inference orchestration (MPI, NCCL, Kubernetes) on large GPU clusters.
Expertise in kernel development (CUTLASS, cuteDSL, tilelang, OpenAI Triton) or compiler/runtime paths (torch.compile, graph lowering, operator fusion). Architectural knowledge of CPU, GPU, FPGA or other DL accelerators; GPU programming experience (CUDA).
Track record of leading ambiguous, high-impact technical programs across multiple teams under tight deadlines.

Benefits

Eligible for equity and company benefits (link to benefits page provided in the original posting).

Additional information

Office policy: Hybrid (#LI-Hybrid).
Location: Santa Clara, California, United States.
Employment type: Full time.
Application deadline: at least until March 13, 2026.
Salary: base salary range is 152,000 USD - 241,500 USD. NVIDIA uses AI tools in its recruiting processes. NVIDIA is an equal opportunity employer.