Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 3 Go @ 1 Kubernetes @ 3 Linux @ 3 Python @ 1 GCP @ 4 GitHub @ 4 CI/CD @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Parallel Programming @ 4 Rust @ 1 Debugging @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 3 GPU @ 4Details
We are seeking highly skilled and motivated software engineers to build AI inference systems that serve large-scale models with extreme efficiency. You will architect and implement high-performance inference stacks, optimize GPU kernels and compilers, drive industry benchmarks, and scale workloads across multi-GPU, multi-node, and multi-cloud environments. You will collaborate across inference, compiler, scheduling, and performance teams to push the frontier of accelerated computing for AI.
Responsibilities
- Contribute features to vLLM to support the newest models and NVIDIA GPU hardware features.
- Profile and optimize the inference framework (vLLM) using methods such as speculative decoding, data/tensor/expert/pipeline-parallelism, and prefill-decode disaggregation.
- Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using techniques such as fusion, autotuning, and memory/layout optimization.
- Build and extend high-level DSLs and compiler infrastructure to boost kernel developer productivity and approach peak hardware utilization.
- Define and build inference benchmarking methodologies and tools; contribute new benchmarks and NVIDIA submissions to the MLPerf Inference benchmarking suite.
- Architect scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
- Conduct and publish original research that advances ML systems, integrate research prototypes into NVIDIA software products.
Requirements
- Bachelor’s degree (or equivalent experience) in Computer Science, Computer Engineering, or Software Engineering with 7+ years of experience; or Master’s degree with 5+ years; or PhD with thesis and top-tier publications in ML Systems, GPU architecture, or high-performance computing.
- Strong programming skills in Python and C/C++; experience with Go or Rust is a plus.
- Solid CS fundamentals: algorithms & data structures, operating systems, computer architecture, parallel programming, distributed systems, and deep learning theory.
- Experience in performance engineering for ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM, SGLang).
- Familiarity with GPU programming and performance: CUDA, memory hierarchy, streams, NCCL; proficiency with profiling/debug tools (e.g., Nsight Systems/Compute).
- Experience with containers and orchestration (Docker, Kubernetes, Slurm); familiarity with Linux namespaces and cgroups.
- Excellent debugging, problem-solving, and communication skills; ability to excel in a fast-paced, cross-functional environment.
Ways to Stand Out
- Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang).
- Hands-on work with ML compilers and DSLs (e.g., Triton, TorchDynamo/Inductor, MLIR/LLVM, XLA), GPU libraries (e.g., CUTLASS) and features (e.g., CUDA Graph, Tensor Cores).
- Experience contributing to containerization/virtualization technologies such as containerd, CRI-O, or CRIU.
- Experience with cloud platforms (AWS, GCP, Azure), infrastructure-as-code, CI/CD, and production observability.
- Contributions to open-source projects and/or publications (include links to GitHub PRs, papers, artifacts).
Compensation & Benefits
- Base salary is location- and experience-dependent. Range provided:
- Level 4: 142,500 CAD - 247,000 CAD
- Level 5: 183,750 CAD - 318,500 CAD
- Eligible for equity and company benefits.
Additional Information
- Hybrid role (#LI-Hybrid).
- Applications accepted at least until November 24, 2025.