Senior Software Engineer, Quantized Inference

at Nvidia

📍 Redmond, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 3 Communication @ 7 Data Analysis @ 4 Debugging @ 4 LLM @ 4 PyTorch @ 3 AI @ 7 vLLM @ 4 SGLang @ 4

Details

NVIDIA is seeking a Senior Software Engineer to accelerate discovery and deployment of efficient quantized and sparse inference recipes for large language models (LLMs). Recipes define which operators are transformed into low-precision or sparsified variants to unlock throughput and latency gains without regressing accuracy or verbosity. Work covers kernel and model-level implementations across inference engines and collaboration with partner inference teams to optimize throughput and interactivity on target workloads.

Responsibilities

Implement quantized and sparse recipes in inference engines (vLLM, TRT-LLM, SGLang).
Translate recipe specifications into functionally correct, performant code (e.g., write Triton kernels, insert quantize/dequantize nodes into prefill and decode paths).
Ensure per-expert scaling in MoE layers is handled correctly.
Own model export pipelines (ModelOpt, Megatron-LM <-> HuggingFace) to ensure quantized checkpoints serialize correctly for downstream serving.
Build prototypes and benchmarking harnesses to evaluate recipe throughput and interactivity before full optimization.
Develop data analysis tooling and visualizations for numerics debugging.
Improve developer productivity across the team (CI, build systems, training infrastructure, pipeline friction).
Participate in code reviews and incorporate feedback.

Requirements

Proficient in Python; familiarity with C++.
Strong software engineering fundamentals: concise, well-tested code; fluent with AI-assisted tooling.
Experience with ML accelerators and a basic understanding of how certain ML layers affect execution time.
Familiarity with PyTorch internals (custom ops, autograd, export) or equivalent framework internals.
Experience reading, modifying, or contributing to a large open-source codebase.
MS/PhD in Computer Science or related field, or equivalent experience.
4+ years in a relevant software engineering role.
Demonstrated ability to move fast with ambiguous requirements, with strong written and verbal communication.

Ways to stand out

Experience contributing to inference serving frameworks (vLLM, TRT-LLM, SGLang) or Triton kernel development.
Track record of debugging numerical issues across mixed-precision boundaries.
Deep experience with model compression techniques: PTQ, QAT, structured/unstructured sparsity.

Compensation & Benefits

Base salary ranges provided: 152,000 USD - 241,500 USD for Level 3; 184,000 USD - 287,500 USD for Level 4.
Eligible for equity and company benefits (link to NVIDIA benefits referenced in original posting).

Other information

Applications accepted at least until March 1, 2026.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer committed to diversity.