AI Inference Performance Engineer - New College Grad 2026

at Nvidia
USD 124,000-241,500 per year
MIDDLE
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Software Development @ 3 Kubernetes @ 3 Python @ 6 Algorithms @ 3 Leadership @ 3 Technical Leadership @ 3 LLM @ 3 PyTorch @ 3 CUDA @ 3 GPU @ 3 Deep Learning @ 3 AI @ 3 Profiling @ 3 vLLM @ 3 GenAI @ 3 NCCL @ 3 TensorRT @ 3 SGLang @ 3 HPC @ 3

Details

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining industry performance standards across language models, video generation, and speech workloads. The team works directly with TensorRT-LLM, SGLang, and vLLM, building tools that evaluate serving performance at scale. This team sits at the intersection of GPU performance engineering and public accountability.

Responsibilities

  • Own the end-to-end optimization pipeline and drive industry benchmark results: implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
  • Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks (multi-turn coding, agentic workflows, etc.) and collaborate with framework and kernel teams to push performance on LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
  • Architect distributed inference: design and optimize execution from single-GPU to rack-scale clusters and manage performance across clusters of GPUs.
  • Establish performance methodology: apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
  • Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects; partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.
  • Provide technical leadership: raise the technical bar, drive cross-functional execution on tight benchmark timelines, and lead high-impact projects.

Requirements

  • BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
  • 2+ years of relevant software development experience.
  • Strong Python or C++ programming, software design, and software engineering skills.
  • Expertise with a deep learning framework such as PyTorch or JAX.
  • Proven track record of delivering measurable performance improvements in deep learning inference or high-performance systems.
  • Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode-phase bottlenecks, speculative decoding, disaggregated serving, etc.

Ways To Stand Out From The Crowd

  • Prior experience with an LLM framework (TensorRT-LLM, vLLM, SGLang) or a DL compiler in inference, deployment, algorithms, or implementation.
  • Prior experience with performance modeling, profiling, debug, and code optimization of DL/HPC/high-performance applications.
  • Experience with scale-out inference orchestration (MPI, NCCL, Kubernetes) on large GPU clusters.
  • Expertise in kernel development (CUTLASS, cuteDSL, tilelang, OpenAI Triton) or compiler/runtime paths (torch.compile, graph lowering, operator fusion). Architectural knowledge of CPU, GPU, FPGA or other DL accelerators; GPU programming experience (CUDA).
  • Track record of leading ambiguous, high-impact technical programs across multiple teams under tight deadlines.

Compensation and Benefits

  • Base salary ranges provided: 124,000 USD - 195,500 USD for Level 2; 152,000 USD - 241,500 USD for Level 3.
  • Eligible for equity and company benefits.

Other Details

  • Applications accepted at least until March 9, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and states a commitment to diversity and non-discrimination.