AI Inference Performance Engineer

at Nvidia
USD 152,000-241,500 per year
MIDDLE
✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences

Software Development @ 5 Kubernetes @ 3 Python @ 6 Algorithms @ 3 Leadership @ 3 Debugging @ 3 Technical Leadership @ 3 LLM @ 3 PyTorch @ 3 CUDA @ 3 GPU @ 3 Deep Learning @ 3 AI @ 3 Profiling @ 3 vLLM @ 3 GenAI @ 3 NCCL @ 3 TensorRT @ 3 SGLang @ 3 HPC @ 3

Details

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining industry performance standards across language models, video generation, and speech workloads. The team works directly with TensorRT-LLM, SGLang, and vLLM, building tools to evaluate serving performance at scale. This role sits at the intersection of GPU performance engineering and public accountability.

Responsibilities

  • Drive industry benchmark results: own the end-to-end optimization pipeline and implement/integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
  • Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
  • Architect distributed inference: design and optimize execution from single-GPU to rack-scale clusters, managing performance across GPU clusters.
  • Establish performance methodology: apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
  • Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.
  • Technical leadership: raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team.

Requirements

  • BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
  • 5+ years of relevant software development experience.
  • Strong Python or C++ programming, software design, and software engineering skills.
  • Expertise with a deep-learning framework such as PyTorch or JAX.
  • Proven track record of delivering measurable performance improvements in deep learning inference or high-performance systems.
  • Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode-phase bottlenecks, speculative decoding, disaggregated serving, etc.

Ways To Stand Out

  • Prior experience with an LLM framework (TensorRT-LLM, vLLM, SGLang, etc.) or a DL compiler in inference, deployment, algorithms, or implementation.
  • Prior experience with performance modeling, profiling, debugging, and code optimization of DL/HPC/high-performance applications.
  • Experience with scale-out inference orchestration (MPI, NCCL, Kubernetes) on large GPU clusters.
  • Expertise in kernel development (CUTLASS, cuteDSL, tilelang, OpenAI Triton) or compiler/runtime paths (torch.compile, graph lowering, operator fusion). Architectural knowledge of CPU, GPU, FPGA or other DL accelerators; GPU programming experience (CUDA).
  • Track record of leading ambiguous, high-impact technical programs across multiple teams under tight deadlines.

Benefits

  • Eligible for equity and company benefits (link to benefits page provided in the original posting).

Additional information

  • Office policy: Hybrid (#LI-Hybrid).
  • Location: Santa Clara, California, United States.
  • Employment type: Full time.
  • Application deadline: at least until March 13, 2026.
  • Salary: base salary range is 152,000 USD - 241,500 USD. NVIDIA uses AI tools in its recruiting processes. NVIDIA is an equal opportunity employer.