Senior Software Engineer, AI Inference

at Nvidia
📍 Toronto, Canada
CAD 135,000-220,000 per year
SENIOR
✅ Hybrid

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Kubernetes @ 4 Communication @ 7 Debugging @ 4 OSS @ 4 LLM @ 4 GPU @ 4 AI @ 7 Profiling @ 4 vLLM @ 4 Slurm @ 4 SGLang @ 4 HPC @ 6 Performance Analysis @ 3

Details

Help push the boundaries of AI inference at NVIDIA by combining deep systems knowledge with hands-on customer engagement. You will profile real deployments, benchmark across GPU clusters, and turn insights into improvements that benefit customers and open-source projects such as vLLM.

Responsibilities

  • Partner directly with customer engineering teams through long-term technical engagements to understand LLM serving architectures and performance goals.
  • Design and implement end-to-end benchmarking campaigns across Kubernetes and Slurm environments to surface actionable insights.
  • Set up and operate vLLM serving deployments on GPU clusters; tune configurations for throughput, latency, and efficiency.
  • Collect Nsight Systems / Nsight Compute profiling traces to identify performance gaps relative to reference frameworks.
  • Develop detailed performance plans based on profiling findings and collaborate with NVIDIA kernel engineering and OSS vLLM teams to drive improvements.
  • Build internal tools, benchmarking harnesses, and automation pipelines to raise team and customer productivity.
  • Document architectures, findings, and recommendations for technical audiences and contribute improvements back to vLLM and related open-source projects.

Requirements

  • Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, or equivalent experience.
  • 5+ years of industry experience building and operating complex, production-grade software systems with strong instincts for systems at scale.
  • Hands-on experience deploying and operating LLM inference workloads, particularly with vLLM, including configuration, optimization, and debugging in real environments.
  • Proficiency with container orchestration (Kubernetes) and HPC scheduling (Slurm) for GPU-accelerated workloads.
  • Solid understanding of LLM serving fundamentals: batching strategies (continuous batching, chunked prefill), KV cache management, and tensor/pipeline parallelism.
  • Familiarity with GPU performance analysis: memory hierarchy, utilization, roofline modeling, and profiling with Nsight Systems or Nsight Compute.
  • Strong written and verbal communication skills; ability to present technical findings clearly and navigate ambiguous, open-ended customer problems.

Ways to Stand Out

  • Experience with NVIDIA Dynamo or other disaggregated inference serving frameworks.
  • Contributions to open-source inference or ML systems projects (particularly vLLM or SGLang).
  • Background with ML compilers or GPU kernel development (Triton, CUTLASS, TorchInductor).
  • Experience building developer tools or internal platforms that improved team productivity.
  • Prior experience in a customer-facing or forward-deployed engineering capacity within a technical product organization.

Compensation & Benefits

Additional Information

  • Applications accepted at least until April 14, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • #LI-Hybrid