Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Linux @ 2
SQL @ 5
Leadership @ 3
Communication @ 3
Jira @ 5
ServiceNow @ 5
QA @ 3
Engineering Management @ 3
Compliance @ 3
Cloud Computing @ 3
AI @ 3
Change Management @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is leading a new era in cloud computing to serve the global AI economy. We create tools and resources for customers to solve real-world challenges and transform industries without massive infrastructure costs or large in-house AI/ML teams. Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team includes more than 400 engineers and an in-house AI R&D team.
Responsibilities
- Own the site 24/7: deliver continuous availability across power, cooling, structured cabling, network, security, and DCIM—meeting or beating global SLAs.
- Build and lead the team: hire, mentor, and develop managers/technicians; run staffing models, shift coverage, and on-call rotations that scale.
- Be the incident commander: lead major events end-to-end—triage, communications, executive briefings, RCA, and durable corrective actions.
- Drive reliability engineering: implement RCM, predictive maintenance, QA/QC, 5S, and Lean/continuous improvement to cut MTTR and raise MTBF.
- Deliver capacity on time: plan and execute expansions/retrofits; commission MEP systems with Design/Construction; achieve flawless change control (MOP/SOP/EOP).
- Scale tooling & automation: mature DCIM/BMS/EPMS, monitoring/alerting, work management (Jira/ServiceNow), knowledge base (Confluence), and light scripting/SQL for telemetry and workflow automation.
- Run a metrics-first operation: publish dashboards and KPIs (availability, PUE, MTBF/MTTR, work compliance, safety) and use them to drive decisions.
- Partner across functions: work with Cloud/Compute, Network, Security, and Capacity Planning to optimize performance, cost, and resiliency across the fleet.
- Manage vendors & colos: own contracts, SLAs, and execution for rack deliveries, PDUs, fiber/copper, and lifecycle PMs; validate colo topology and compliance.
- Raise the safety bar: enforce a zero-injury EHS culture; conduct drills/audits for life safety, physical security, and data protection.
- Forecast and budget: build data-backed plans for power, spares, headcount, and projects; track OpEx/CapEx with rigor.
Requirements
- Associate's degree or trade certification in Electrical/Mechanical/Industrial Engineering (or equivalent experience).
- 10+ years in electrical/mechanical/HVAC/controls within industrial/commercial settings, 5+ years specifically in data center or mission-critical facilities.
- Team leadership experience in 24/7 sites (managing leads/techs, vendors, and on-call operations).
- Deep, hands-on knowledge of UPS/generators/switchgear, chillers/CRAC/CRAH, fire detection/suppression, BMS/EPMS/DCIM, and structured cabling (copper & fiber).
- Proven strength in incident management, RCA/Corrective Actions, change management, and vendor/contract oversight.
- Data-driven mindset with the ability to forecast resources and make analytics-backed decisions (Excel; SQL/scripting a plus).
- Excellent written/verbal communication with comfort presenting to executives and guiding field teams during live events.
- Ability to travel up to ~25% and support after-hours escalations when needed.
Nice to have
- Bachelor's degree in Electrical/Mechanical/Industrial Engineering, Engineering Management, or Reliability Engineering.
- Hyperscale/colo experience with reliability-centered maintenance, predictive analytics, and Lean/Six Sigma practices.
- Familiarity with Linux fundamentals, network equipment installation/troubleshooting, and fiber optics testing.
- Experience with Jira, Confluence, ServiceNow (or similar); strong SOP/MOP/EOP authorship.
- Certifications such as CDCP, DCM, PMP, OSHA-30, ITIL, or Uptime-aligned credentials.
Benefits
- Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
- 401(k) plan: up to 4% company match with immediate vesting.
- Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
- Remote work reimbursement: up to $85/month for mobile and internet.
- Disability & life insurance: company-paid short-term, long-term and life insurance coverage.
Compensation
We offer competitive salaries, ranging from $90k-$140k base + quarterly performance bonuses.