Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 4
Linux @ 4
CI/CD @ 4
Distributed Systems @ 4
gRPC @ 4
Debugging @ 7
API @ 4
Reporting @ 4
AI @ 4
vLLM @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA’s LPU System Software team builds foundational software that enables deterministic, high-performance computing platforms by shifting complexity from silicon into software. The team designs and maintains hardware abstraction layers, core system libraries, and runtime components that allow compiler teams and data center operators to safely and efficiently execute workloads on novel architectures. In this role you will develop and evolve the libraries, drivers, and runtime interfaces that serve as key entry points into the platform and help improve reliability and operability through automation, diagnostics, and tight cross-org collaboration with hardware, compiler, and operations teams.
Responsibilities
- Extend and maintain hardware abstraction layers and core system libraries used across the platform.
- Design and implement drivers, runtimes, and data movement/aggregation pipelines supporting workload execution.
- Build and maintain runtime interfaces for launching, monitoring, and managing workloads.
- Improve platform reliability through automation, error reporting, diagnostics, and operational tooling.
- Debug and resolve complex sequencing, initialization, and runtime issues across multi-component systems.
- Partner cross-functionally with hardware engineering, compiler teams, and data center operations to bring features from prototype to production.
- Support new platform bring-up and NPI (New Product Introduction) efforts for new boards and silicon.
- Contribute to engineering excellence through documentation, tooling improvements, code reviews, and knowledge sharing.
Requirements
- Masters Degree in Computer Science, Computer Engineering, Electrical Engineering, related STEM field or equivalent experience.
- 5+ years of relevant work experience.
- Strong proficiency in modern C++ (design, implementation, debugging, and performance considerations).
- Experience designing, maintaining, and refactoring software libraries and APIs with long-term support in mind.
- Comfortable working in large, multi-repository or multi-component codebases with layered dependencies.
- Demonstrated ability to lead or drive triage of difficult reliability issues and produce clear root-cause analysis.
- Ability to clearly communicate software architecture and design tradeoffs, including using diagrams and written design docs.
- Low-level platform software experience (e.g., firmware/boot flows, RTOS, BMCs/MCUs, RISC-V, or closely related system software).
- Linux systems experience including driver or kernel-adjacent interfaces (e.g., VFIO or similar subsystems).
- Hardware bring-up and/or system triage experience (fault analysis, system diagnostics, or validation support in lab environments).
Ways to stand out
- Distributed systems experience (e.g., MPI, gRPC, RPC frameworks, coordination/telemetry patterns).
- Experience with inference systems and token serving (e.g., vLLM or similar serving/runtime stacks).
- Experience shipping and supporting customer-facing SDKs, including documentation and ABI compatibility practices.
- Production readiness and delivery experience (e.g., CI/CD and release workflows, monitoring/alerting practices, Kubernetes and/or data center operational workflows).
Compensation & Benefits
- Base salary ranges provided by location and level: 135000 CAD - 185000 CAD for Level 3, and 170000 CAD - 220000 CAD for Level 4.
- You will also be eligible for equity and benefits (link: https://www.nvidia.com/en-us/benefits/).
Additional details
- Applications for this job will be accepted at least until March 22, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- Location specified: Toronto, Canada. Employment type: Full time.