Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Linux @ 4
Leadership @ 4
Communication @ 4
Networking @ 4
Technical Leadership @ 4
Cloud Computing @ 4
GPU @ 4
AI @ 4
HPC @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is leading a new era in cloud computing to serve the global AI economy. We create tools and resources to help customers solve real-world challenges without massive infrastructure costs or large in-house AI/ML teams. The company is headquartered in Amsterdam, listed on Nasdaq, and has R&D hubs across Europe, North America, and Israel.
This role sits at the intersection of hardware, Linux systems, and operational execution. You will own end-to-end delivery, deployment, and production readiness of next-generation GPU platforms inside data centers, leading on-site rack bring-up, validating NVIDIA-based AI systems, coordinating repairs, and ensuring GB-series infrastructure moves from installation to fully operational production environments.
Responsibilities
- Lead end-to-end deployment of GB-series racks within data center environments
- Oversee installation, bring-up, validation, and production readiness of NVIDIA H200 and B200-based servers
- Troubleshoot complex hardware, firmware, Linux OS, and networking issues
- Execute structured testing and validation procedures during deployment
- Develop and maintain basic Linux-based hardware health-check and diagnostic scripts
- Coordinate on-site hardware repairs, part replacements, and vendor escalations
- Drive root cause analysis and ensure corrective actions are implemented
- Manage and prioritize deployment timelines across multiple concurrent rollouts
- Provide technical leadership and guidance to on-site engineers and technicians
- Partner with networking and infrastructure teams to ensure seamless integration
- Document deployment processes, validation standards, and operational runbooks
Requirements
- Strong hands-on experience deploying and operating data center infrastructure
- Deep familiarity with GPU-dense systems, ideally NVIDIA H-series platforms (H200, B200)
- Experience working with high-density rack deployments (GB-series or similar)
- Solid Linux experience, including troubleshooting and scripting
- Ability to diagnose issues across hardware, OS, firmware, and network layers
- Experience coordinating field repairs and working directly with hardware vendors
- Proven experience leading technical teams or overseeing field operations
- High ownership mindset and ability to operate in production-critical environments
- Clear communication skills and ability to collaborate across distributed teams
Nice to have
- Experience deploying AI or HPC clusters at scale
- Familiarity with automated provisioning or infrastructure lifecycle systems
- Background in hardware qualification, burn-in testing, or factory validation
- Experience supporting rapid infrastructure expansion
- Exposure to ARM-based or heterogeneous compute environments
Working conditions
- Collaboration with globally distributed engineering and operations teams
- On-site deployment and field work in mission-critical data center environments
Benefits
- 100% company-paid medical, dental, and vision coverage for employees and families
- 401(k) plan: up to 4% company match with immediate vesting
- Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers
- Remote work reimbursement: up to $85/month for mobile and internet
- Company-paid short-term, long-term disability & life insurance
- Competitive salary and comprehensive benefits package
- Opportunities for professional growth and flexible working arrangements
Compensation
- Base salary range: $125k - $180k base + quarterly performance bonuses.