Senior Inference Engineer, AIConfigurator for Dynamo

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ Hybrid

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Kubernetes @ 4 Python @ 4 GitHub @ 4 Distributed Systems @ 4 Communication @ 7 Rust @ 4 Debugging @ 7 API @ 4 LLM @ 4 GPU @ 4 AI @ 4 Profiling @ 4 Agentic AI @ 4 vLLM @ 4 NCCL @ 3 TensorRT @ 4 SGLang @ 4 Performance Analysis @ 4

Details

NVIDIA is recruiting a Senior Inference Engineer to advance AIConfigurator (https://github.com/ai-dynamo/aiconfigurator), a system that automatically discovers high-performance deployment configurations for large-scale LLM inference. This role integrates GPU systems, model serving, performance modeling, and production software engineering to help users deploy models on NVIDIA platforms by optimizing efficiency, latency, parallelism, and resource utilization across aggregated and disaggregated serving architectures. The team partners with Dynamo, TensorRT-LLM, vLLM, SGLang, benchmarking, and platform teams to translate performance data into deployment guidance. This is a high-impact individual contributor role focused on owning deep technical systems and making them practical for developers and customers.

Responsibilities

Build and evolve AIConfigurator's core optimization engine for LLM serving, including configuration search, SLA-aware ranking, efficiency and latency estimation, and Pareto frontier analysis.
Build production-quality Python/Rust APIs, CLIs, SDK surfaces, and web workflows to help users generate deployment configurations for NVIDIA GPU clusters.
Develop configuration generation systems that emit backend-specific artifacts for Dynamo, Kubernetes, TensorRT-LLM, vLLM, and SGLang deployments.
Collaborate with inference runtime, performance, benchmarking, and product groups to ensure simulated results correspond with actual deployment performance on H100, H200, B200, GB200, and upcoming NVIDIA platforms.
Improve model, hardware, and backend support by integrating performance databases, profiling data, support matrices, and validation tools.
Drive software quality through maintainable architecture, schema development, tests, documentation, and automation suitable for both open-source and production users.
Convert inference concepts (prefill/decode disaggregation, tensor parallelism, pipeline parallelism, expert parallelism, batching, KV cache behavior) into dependable software abstractions.

Requirements

BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Math, or a related field, or equivalent experience.
10+ years of relevant software engineering experience.
Strong Python and Rust engineering skills, including production APIs, CLI tools, packaging, testing, and debugging.
Experience with GPU computing, distributed systems, ML infrastructure, or high-performance model serving.
Understanding of LLM inference concepts: batching, latency, efficiency, memory constraints, and parallelism strategies; serving SLAs.
Experience with data-driven performance analysis, benchmarking, simulation, optimization, or managing resource needs.
Ability to collaborate across research, runtime, platform, and customer-facing engineering teams.
Strong written and verbal communication skills to explain sophisticated technical tradeoffs clearly.

Ways to stand out

Practical experience with TensorRT-LLM, vLLM, SGLang, Triton Inference Server, Dynamo, or Kubernetes.
Experience improving LLM deployments on NVIDIA GPUs, especially H100, H200, B200, GB200, or multi-node GPU clusters.
Familiarity with disaggregated serving, prefill/decode separation, KV cache management, NCCL/NIXL/NVSHMEM communication, or expert-parallel MoE inference.
Open-source project experience, technical writing, or prior ownership of developer-facing tools.
Experience applying agentic AI solutions to solve complex technical problems.

Compensation & Benefits

Base salary ranges: Level 4: 184,000 USD - 287,500 USD; Level 5: 224,000 USD - 356,500 USD.
Eligible for equity and additional benefits (see NVIDIA benefits page).

Other

Location: Santa Clara, CA, United States. #LI-Hybrid
Applications accepted at least until June 16, 2026.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.