Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Linux @ 4
Leadership @ 4
Communication @ 4
Networking @ 4
Technical Leadership @ 4
Cloud Computing @ 4
GPU @ 4
AI @ 4
HPC @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is leading a new era in cloud computing to serve the global AI economy. They build GPU-dense AI infrastructure and tools to help customers run AI/ML workloads without massive infrastructure costs. The company is headquartered in Amsterdam with R&D hubs across Europe, North America, and Israel and a team of over 800 employees.
This role sits at the intersection of hardware, Linux systems, and operational execution. The position is responsible for end-to-end delivery, deployment, and production readiness of next-generation GPU platforms inside data centers, including on-site rack bring-up, validation of NVIDIA-based AI systems, coordinating repairs, and driving production readiness for GB-series infrastructure.
Responsibilities
- Lead end-to-end deployment of GB-series racks within data center environments
- Oversee installation, bring-up, validation, and production readiness of NVIDIA H200 and B200-based servers
- Troubleshoot complex hardware, firmware, Linux OS, and networking issues
- Execute structured testing and validation procedures during deployment
- Develop and maintain basic Linux-based hardware health-check and diagnostic scripts
- Coordinate on-site hardware repairs, part replacements, and vendor escalations
- Drive root cause analysis and ensure corrective actions are implemented
- Manage and prioritize deployment timelines across multiple concurrent rollouts
- Provide technical leadership and guidance to on-site engineers and technicians
- Partner with networking and infrastructure teams to ensure seamless integration
- Document deployment processes, validation standards, and operational runbooks
Requirements
- Strong hands-on experience deploying and operating data center infrastructure
- Deep familiarity with GPU-dense systems, ideally NVIDIA H-series platforms (H200/B200)
- Experience working with high-density rack deployments (GB-series or similar)
- Solid Linux experience, including troubleshooting and scripting
- Ability to diagnose issues across hardware, OS, firmware, and network layers
- Experience coordinating field repairs and working directly with hardware vendors
- Proven experience leading technical teams or overseeing field operations
- High ownership mindset and ability to operate in production-critical environments
- Clear communication skills and ability to collaborate across distributed teams
Nice to have
- Experience deploying AI or HPC clusters at scale
- Familiarity with automated provisioning or infrastructure lifecycle systems
- Background in hardware qualification, burn-in testing, or factory validation
- Experience supporting rapid infrastructure expansion
- Exposure to ARM-based or heterogeneous compute environments
Working conditions
- Collaboration with globally distributed engineering and operations teams
- Role requires on-site deployments in data center environments (field travel and on-site presence expected)
Benefits
- Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families
- 401(k) plan: up to 4% company match with immediate vesting
- Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers
- Remote work reimbursement: up to $85/month for mobile and internet
- Disability & life insurance: company-paid short-term, long-term, and life insurance coverage
- Competitive salary and comprehensive benefits package, opportunities for professional growth, and flexible working arrangements
Compensation
- Base salary range: $125,000 - $180,000 per year, plus quarterly performance bonuses.