Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Linux @ 3
Networking @ 2
GPU @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is building a full-stack AI cloud platform and operates production data centers globally. This role owns advanced hardware troubleshooting and RMA lifecycle management within production data center environments, serving as the escalation point for complex server and firmware-related issues that impact system reliability and fleet availability.
Responsibilities
- Perform advanced firmware and hardware diagnostics on enterprise server platforms, including CPU, memory, PCIe devices, GPUs, storage subsystems, and power components.
- Troubleshoot complex hardware failures using system logs, BMC/IPMI interfaces, BIOS diagnostics, and vendor-specific tooling.
- Act as the primary escalation point for L1 and L2 technicians on high-impact hardware incidents.
- Conduct structured root cause analysis and document findings to prevent repeat failures.
- Own the full RMA lifecycle, including validation of failed components, warranty claim creation, vendor coordination, tracking, and resolution.
- Interface directly with OEM vendors to escalate recurring hardware defects and drive corrective action.
- Analyze hardware failure trends and report metrics such as repeat RMA rates and component reliability.
- Develop and standardize diagnostic playbooks, troubleshooting workflows, and hardware validation procedures.
- Validate replacement components prior to redeployment into production environments.
- Collaborate cross-functionally with data center operations, procurement, and engineering teams to improve hardware lifecycle processes.
- Contribute to reducing MTTR and improving fleet-wide reliability through process improvements and knowledge sharing.
Requirements
- 5+ years of hands-on experience working with enterprise server hardware in a production data center environment.
- Deep understanding of x86 server architecture, including CPUs, memory, PCIe devices, storage controllers, GPUs, and power subsystems.
- Strong experience performing firmware and BIOS/BMC diagnostics and upgrades.
- Advanced Linux command-line troubleshooting skills, including log analysis and hardware-level diagnostics.
- Experience working with remote management interfaces such as IPMI, iDRAC, iLO, or equivalent.
- Proven experience managing hardware RMA processes and working directly with OEM vendors.
- Ability to conduct structured root cause analysis and document technical findings clearly.
- Familiarity with hardware monitoring systems and failure trend analysis.
- Strong ownership mindset and ability to operate independently in mission-critical environments.
- High proficiency in spoken and written English.
- Valid driver’s license.
Nice to have
- Experience performing board-level diagnostics and component-level repair (SMD rework).
- Familiarity with data center networking equipment and basic network troubleshooting.
- Experience supporting GPU-dense or high-performance compute environments.
Compensation
- Competitive salaries, listed as $76,800 - $184,300 OTE, based on experience and skills.
Benefits
- Competitive compensation
- Career growth and learning opportunities
- Flexibility and work-life balance
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
Other details
- Work location: on-site in one of Nebius' data centers (Missouri location; Minnesota location is also indicated as available). Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire.