Senior GenAI Algorithms Engineer — Model Optimizations for Inference
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 6 Algorithms @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 Debugging @ 7 API @ 4 LLM @ 4 PyTorch @ 6 CUDA @ 4 GPU @ 4Details
NVIDIA is at the forefront of the generative AI revolution. The Algorithmic Model Optimization Team focuses on optimizing generative AI models (large language models, visual-language models, multimodal, and diffusion models) for maximal inference efficiency using techniques such as quantization, speculative decoding, sparsity, distillation, pruning, neural architecture search, and streamlined deployment strategies with open-source inference frameworks. This role is responsible for designing, implementing, and productionizing model optimization algorithms for inference and deployment on NVIDIA hardware, with a focus on ease of use, compute and memory efficiency, and accuracy–performance tradeoffs through software–hardware co-design.
Responsibilities
- Design and build modular, scalable model optimization software platforms that deliver strong user experiences while supporting diverse AI models and optimization techniques.
- Explore, develop, and integrate deep learning optimization algorithms (e.g., quantization, speculative decoding, sparsity) into NVIDIA's AI software stack, including TensorRT Model Optimizer, NeMo/Megatron, and TensorRT-LLM.
- Deploy optimized models into leading open-source inference frameworks and contribute specialized APIs, model-level optimizations, and new features targeted to NVIDIA hardware capabilities.
- Partner with internal NVIDIA teams to deliver model optimization solutions for customer use cases, ensuring optimal end-to-end workflows and balanced accuracy-performance trade-offs.
- Conduct deep GPU kernel-level profiling to identify hardware and software optimization opportunities (e.g., efficient attention kernels, KV cache optimization, parallelism strategies).
- Drive continuous innovation in deep learning inference performance to strengthen NVIDIA platform integration and expand market adoption across the AI inference ecosystem.
Requirements
- Master's, PhD, or equivalent experience in Computer Science, Artificial Intelligence, Applied Mathematics, or a related field.
- 5+ years of relevant work or research experience in deep learning.
- Strong software design skills, including debugging, performance analysis, and test development.
- Proficiency in Python, PyTorch, and modern ML frameworks/tools.
- Proven foundation in algorithms and programming fundamentals.
- Strong written and verbal communication skills, with the ability to work independently and collaboratively in a fast-paced environment.
Ways to stand out
- Contributions to PyTorch, JAX, vLLM, SGLang, or other ML training and inference frameworks.
- Hands-on experience training or fine-tuning generative AI models on large-scale GPU clusters.
- Proficiency with GPU architectures and compilation stacks and adept at analyzing and debugging end-to-end performance.
- Familiarity with NVIDIA's deep learning SDKs (e.g., TensorRT).
- Experience developing high-performance GPU kernels for ML workloads using CUDA, CUTLASS, or Triton.
Compensation & Benefits
- Base salary ranges (location and level dependent):
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 184,000 USD - 287,500 USD
- Eligible for equity and benefits (see NVIDIA benefits page).
Additional Information
- Location: Santa Clara, CA, United States.
- Time type: Full time.
- Applications for this job will be accepted at least until September 26, 2025.
- NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.