Senior Applied AI Software Engineer, Distributed Inference Systems

at Nvidia

📍 Santa Clara, United States

USD 148,000-287,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Docker @ 4 Go @ 4 Kubernetes @ 4 Python @ 4 GitHub @ 4 CI/CD @ 3 Algorithms @ 4 Distributed Systems @ 7 Communication @ 4 Networking @ 4 Rust @ 4 API @ 4 LLM @ 4 GPU @ 4

Details

NVIDIA Dynamo is an open-source platform focused on efficient, scalable inference for large language and reasoning models in distributed GPU environments. The team builds serving architecture, GPU resource management, and intelligent request handling to deliver high-performance AI inference at scale. This role focuses on advancing the Dynamo project to support production-grade distributed inference across thousands of GPUs and multiple inference engines.

Responsibilities

Build the Kubernetes deployment and workload management stack for Dynamo to facilitate inference deployments at scale; identify bottlenecks and apply optimizations to fully use hardware capacity.
Design, implement, and optimize distributed inference components in Rust and Python.
Introduce new features to the Dynamo Python SDK and the Dynamo Rust runtime core library.
Contribute to disaggregated serving: separate prefill (context ingestion) and decode (token generation) phases across distinct GPU clusters and extend disaggregation to multi-modal models (vision-language, audio-language, video-language).
Develop and refine planner algorithms for dynamic GPU scheduling, allocation, and rebalancing based on fluctuating workloads and system bottlenecks.
Improve intelligent routing to direct inference requests to GPU worker replicas with relevant KV cache data to minimize re-computation and latency for multi-step reasoning tasks.
Innovate in distributed KV cache management and transfers across heterogeneous memory and storage hierarchies using the NVIDIA Optimized Transfer Library (NIXL) for low-latency, cost-effective data movement.
Contribute to open-source repositories, participate in code reviews, assist with issue triage on GitHub, and write clear documentation and developer/user guides.
Work closely with the community to capture feedback and evolve APIs and architecture.

Requirements

BS/MS or higher in computer engineering, computer science, or related engineering (or equivalent experience).
5+ years of proven experience in a related field.
Strong proficiency in systems programming (Rust and/or C++). Experience with Python for workflow and API development. Experience with Go for Kubernetes controllers and operators is desirable.
Deep understanding of distributed systems, parallel computing, and GPU architectures.
Experience with cloud-native deployment and container orchestration (Kubernetes, Docker).
Experience with large-scale inference serving, LLMs, or similar high-performance AI workloads.
Background in memory management, data transfer optimization, and multi-node orchestration.
Familiarity with open-source development workflows (GitHub, CI/CD).
Excellent problem-solving and communication skills.

Ways to stand out

Prior contributions to open-source AI inference frameworks (e.g., vLLM, TensorRT-LLM, SGLang).
Experience with GPU resource scheduling, cache management, or high-performance networking.
Deep understanding of LLM-specific inference challenges, such as context window scaling and multi-model agentic workflows.

Compensation & Application

Base salary ranges (location, experience and level dependent):
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 184,000 USD - 287,500 USD
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until July 29, 2025.

Company & Diversity

NVIDIA offers competitive salaries and a comprehensive benefits package. NVIDIA is an equal opportunity employer and values diversity in its employees; they do not discriminate on the basis of protected characteristics.