Used Tools & Technologies
Machine Learning HPCRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 4
Linux @ 6
Hiring @ 4
Bash @ 4
Networking @ 4
Debugging @ 4
GPU @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is building a full-stack AI cloud platform for large-scale GPU orchestration, inference optimization, and AI/ML infrastructure. The team works across compute, storage, networking and applied AI, with R&D hubs across Europe, the UK, North America and Israel.
Responsibilities
- Design and implement embedded firmware for server management, telemetry, and control systems.
- Maintain and enhance custom OpenBMC firmware with new features and improvements.
- Enable real-time monitoring of power, thermal sensors, and hardware health.
- Work closely with hardware engineers to validate firmware for existing and future platforms.
- Debug and optimize low-level drivers and protocols.
- Contribute to long-term firmware architecture for GPU cluster reliability.
Requirements
- 5+ years in embedded systems or firmware development.
- Proficiency in embedded Linux.
- Hands-on experience with BMCs, microcontrollers, or SoC firmware.
- Understanding of hardware bring-up and debugging.
- Languages: C, C++, Bash, Go, YAML.
- Firmware: OpenBMC, U-Boot, Linux Kernel.
- Interfaces: I2C, I3C, SPI, eSPI, UART, LPC.
- Protocols: SMBus, PCIe, PMBus, PECI.
- Build systems: Meson, CMake.
- Descriptors & formats: FRU, SMBIOS, ACPI, DMI.
Preferred
- Knowledge of the Yocto Project principles.
- Knowledge of systems and D-Bus principles.
- Proficiency in C++ and strong C skills for Linux drivers and U-Boot.
- Experience developing Linux drivers (sysfs, hwmon interfaces).
- Experience with server BMC firmware (IPMI, IPMB, KCS, SSIF, Redfish, PLDM).
- Knowledge of GPU/CPU telemetry frameworks (e.g., NVML, DCGM).
- Exposure to firmware security (Secure Boot, signed firmware).
- Experience with RAS (Reliability, Availability, Serviceability).
- Background in high-performance computing or data center hardware.
Compensation
- Base compensation range: $179,500 — $269,200 USD. Actual compensation will be determined based on experience, skills, qualifications, hiring level, and geographic location.
Benefits & Perks
- Competitive compensation
- Career growth and learning opportunities
- Flexibility and ownership
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
Work Authorization
- Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire.