Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 6 Kubernetes @ 3 Linux @ 5 Python @ 6 Algorithms @ 3 Data Structures @ 3 Rust @ 6 CUDA @ 3 GPU @ 3Details
We are seeking highly motivated and skilled systems engineers to join a team developing an AI platform that provides efficient infrastructure for inference and training of large-scale models. The role focuses on building a unified solution that integrates NVIDIA technologies (high-performance inference/training frameworks, ML compilers, performance predictors, and cluster schedulers) into a cohesive platform.
Responsibilities
- Participate in development of an AI platform for training, fine-tuning, and serving state-of-the-art AI models with optimal performance and efficiency.
- Design and build solutions for scheduling large-scale AI training and inference workloads on GPU clusters across multiple cloud infrastructures.
- Explore and find solutions to open problems such as industry-scale resource management, GPU scheduling, performance prediction, and live workload migration.
- Collaborate with and contribute to adjacent teams and components, including TensorRT/Dynamo inference engine, ML compilers, KAI/Grove scheduler, and Lepton cloud.
Requirements
- Bachelor's degree or equivalent experience in Computer Science, Computer Engineering, or a relevant technical field.
- 5+ years of experience.
- Experience building large-scale systems from scratch; prior experience with container-based deployment systems (e.g., Kubernetes) is beneficial.
- Strong coding skills in one or more of: Python, Go, Rust, and/or C/C++.
- Solid foundation in algorithms and data structures, operating systems, and computer architecture.
- Strong understanding of AI and related technologies is a plus.
- Ability to quickly grasp new concepts and thrive in evolving situations.
Preferred / Ways to stand out
- Graduate-level education or relevant practical research background.
- Practical experience building and optimizing AI applications.
- Proficiency with container software such as containerd, CRI-O, Linux namespaces, CRIU.
- Experience with NVIDIA GPU technologies such as CUDA graphs and driver/runtime internals.
Benefits
- Base salary range: 116,250 CAD - 201,500 CAD (determined based on location, experience, and comparable roles).
- Eligible for equity and additional benefits (see company benefits page).
- Applications for this job will be accepted at least until September 6, 2025.