Senior Inference Engineer, AIConfigurator for Dynamo
Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 4
Python @ 4
GitHub @ 4
Distributed Systems @ 4
Communication @ 7
Rust @ 4
Debugging @ 7
API @ 4
LLM @ 4
GPU @ 4
AI @ 4
Profiling @ 4
Agentic AI @ 4
vLLM @ 4
NCCL @ 3
TensorRT @ 4
SGLang @ 4
Performance Analysis @ 4
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is recruiting a Senior Inference Engineer to advance AIConfigurator (https://github.com/ai-dynamo/aiconfigurator), a system that automatically discovers high-performance deployment configurations for large-scale LLM inference. This role integrates GPU systems, model serving, performance modeling, and production software engineering to help users deploy models on NVIDIA platforms by optimizing efficiency, latency, parallelism, and resource utilization across aggregated and disaggregated serving architectures. The team partners with Dynamo, TensorRT-LLM, vLLM, SGLang, benchmarking, and platform teams to translate performance data into deployment guidance. This is a high-impact individual contributor role focused on owning deep technical systems and making them practical for developers and customers.
Responsibilities
- Build and evolve AIConfigurator's core optimization engine for LLM serving, including configuration search, SLA-aware ranking, efficiency and latency estimation, and Pareto frontier analysis.
- Build production-quality Python/Rust APIs, CLIs, SDK surfaces, and web workflows to help users generate deployment configurations for NVIDIA GPU clusters.
- Develop configuration generation systems that emit backend-specific artifacts for Dynamo, Kubernetes, TensorRT-LLM, vLLM, and SGLang deployments.
- Collaborate with inference runtime, performance, benchmarking, and product groups to ensure simulated results correspond with actual deployment performance on H100, H200, B200, GB200, and upcoming NVIDIA platforms.
- Improve model, hardware, and backend support by integrating performance databases, profiling data, support matrices, and validation tools.
- Drive software quality through maintainable architecture, schema development, tests, documentation, and automation suitable for both open-source and production users.
- Convert inference concepts (prefill/decode disaggregation, tensor parallelism, pipeline parallelism, expert parallelism, batching, KV cache behavior) into dependable software abstractions.
Requirements
- BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Math, or a related field, or equivalent experience.
- 10+ years of relevant software engineering experience.
- Strong Python and Rust engineering skills, including production APIs, CLI tools, packaging, testing, and debugging.
- Experience with GPU computing, distributed systems, ML infrastructure, or high-performance model serving.
- Understanding of LLM inference concepts: batching, latency, efficiency, memory constraints, and parallelism strategies; serving SLAs.
- Experience with data-driven performance analysis, benchmarking, simulation, optimization, or managing resource needs.
- Ability to collaborate across research, runtime, platform, and customer-facing engineering teams.
- Strong written and verbal communication skills to explain sophisticated technical tradeoffs clearly.
Ways to stand out
- Practical experience with TensorRT-LLM, vLLM, SGLang, Triton Inference Server, Dynamo, or Kubernetes.
- Experience improving LLM deployments on NVIDIA GPUs, especially H100, H200, B200, GB200, or multi-node GPU clusters.
- Familiarity with disaggregated serving, prefill/decode separation, KV cache management, NCCL/NIXL/NVSHMEM communication, or expert-parallel MoE inference.
- Open-source project experience, technical writing, or prior ownership of developer-facing tools.
- Experience applying agentic AI solutions to solve complex technical problems.
Compensation & Benefits
- Base salary ranges: Level 4: 184,000 USD - 287,500 USD; Level 5: 224,000 USD - 356,500 USD.
- Eligible for equity and additional benefits (see NVIDIA benefits page).
Other
- Location: Santa Clara, CA, United States. #LI-Hybrid
- Applications accepted at least until June 16, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.