Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Linux @ 5
Python @ 5
Bash @ 5
Networking @ 3
Performance Optimization @ 3
Debugging @ 3
Cloud Computing @ 3
GPU @ 3
AI @ 3
InfiniBand @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside experienced leaders and engineers.
Where we work: Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team includes over 800 employees and more than 400 engineers with expertise across hardware and software engineering and an in-house AI R&D team.
Role overview
Nebius is looking for a System Engineer (Servers Hardware R&D Team) to support expanding North American operations. This position requires occasional on-site presence in data center locations as needed. The role focuses on design, deployment, testing, troubleshooting, and performance optimization of high-performance, GPU-based cloud systems for AI workloads.
Responsibilities
- Participate in the design, deployment, and maintenance of high-performance cloud systems optimized for AI workloads.
- Arrange and perform hardware R&D tests and experiments on-site in data center environments.
- Troubleshoot and resolve complex system issues related to GPUs, networking (InfiniBand, NVLink), PCIe, and server infrastructure.
- Conduct deep investigations into hardware, software, and networking issues to ensure optimal system performance and reliability.
- Develop and execute test plans and methodologies for advanced GPU, InfiniBand, and compute systems to benchmark and validate performance.
- Collaborate closely with cross-functional engineering and operations teams to improve system performance and reliability.
- Monitor system performance and continuously fine-tune configurations for maximum efficiency.
Requirements
- Strong knowledge of modern server architecture, particularly in high-performance, GPU-based environments.
- Hands-on experience with GPUs, networking, NVLink, and PCIe technologies.
- Proficiency in Linux systems, with experience using Python and Bash for automation and tooling.
- Demonstrated ability to troubleshoot complex hardware, software, and networking issues.
- Experience with deep problem investigation, root cause analysis, and performance optimization in cloud or high-performance computing environments.
- Strong analytical and problem-solving skills with a performance-first mindset.
- Basic electronics modification skills, including soldering and wiring.
Nice to have
- Knowledge of the Linux kernel and experience with kernel-level debugging or troubleshooting.
- Familiarity with electronic measurement equipment such as oscilloscopes and multimeters.
Benefits
- Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
- 401(k) plan: up to 4% company match with immediate vesting.
- Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
- Remote work reimbursement: up to $85/month for mobile and internet.
- Disability & life insurance: company-paid short-term, long-term and life insurance coverage.
- Competitive salary and comprehensive benefits package, opportunities for professional growth, flexible working arrangements, and a dynamic collaborative environment.
Compensation
- Base salary range: $150,000 - $200,000 per year, plus quarterly performance bonuses.