Data Center Site Manager

at Nebius
USD 90,000-140,000 per year
MIDDLE
✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Security @ 3 Linux @ 2 SQL @ 5 Leadership @ 3 Communication @ 3 Jira @ 5 ServiceNow @ 5 QA @ 3 Engineering Management @ 3 Compliance @ 3 Cloud Computing @ 3 AI @ 3 Change Management @ 3

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. We create tools and resources for customers to solve real-world challenges and transform industries without massive infrastructure costs or large in-house AI/ML teams. Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team includes more than 400 engineers and an in-house AI R&D team.

Responsibilities

  • Own the site 24/7: deliver continuous availability across power, cooling, structured cabling, network, security, and DCIM—meeting or beating global SLAs.
  • Build and lead the team: hire, mentor, and develop managers/technicians; run staffing models, shift coverage, and on-call rotations that scale.
  • Be the incident commander: lead major events end-to-end—triage, communications, executive briefings, RCA, and durable corrective actions.
  • Drive reliability engineering: implement RCM, predictive maintenance, QA/QC, 5S, and Lean/continuous improvement to cut MTTR and raise MTBF.
  • Deliver capacity on time: plan and execute expansions/retrofits; commission MEP systems with Design/Construction; achieve flawless change control (MOP/SOP/EOP).
  • Scale tooling & automation: mature DCIM/BMS/EPMS, monitoring/alerting, work management (Jira/ServiceNow), knowledge base (Confluence), and light scripting/SQL for telemetry and workflow automation.
  • Run a metrics-first operation: publish dashboards and KPIs (availability, PUE, MTBF/MTTR, work compliance, safety) and use them to drive decisions.
  • Partner across functions: work with Cloud/Compute, Network, Security, and Capacity Planning to optimize performance, cost, and resiliency across the fleet.
  • Manage vendors & colos: own contracts, SLAs, and execution for rack deliveries, PDUs, fiber/copper, and lifecycle PMs; validate colo topology and compliance.
  • Raise the safety bar: enforce a zero-injury EHS culture; conduct drills/audits for life safety, physical security, and data protection.
  • Forecast and budget: build data-backed plans for power, spares, headcount, and projects; track OpEx/CapEx with rigor.

Requirements

  • Associate's degree or trade certification in Electrical/Mechanical/Industrial Engineering (or equivalent experience).
  • 10+ years in electrical/mechanical/HVAC/controls within industrial/commercial settings, 5+ years specifically in data center or mission-critical facilities.
  • Team leadership experience in 24/7 sites (managing leads/techs, vendors, and on-call operations).
  • Deep, hands-on knowledge of UPS/generators/switchgear, chillers/CRAC/CRAH, fire detection/suppression, BMS/EPMS/DCIM, and structured cabling (copper & fiber).
  • Proven strength in incident management, RCA/Corrective Actions, change management, and vendor/contract oversight.
  • Data-driven mindset with the ability to forecast resources and make analytics-backed decisions (Excel; SQL/scripting a plus).
  • Excellent written/verbal communication with comfort presenting to executives and guiding field teams during live events.
  • Ability to travel up to ~25% and support after-hours escalations when needed.

Nice to have

  • Bachelor's degree in Electrical/Mechanical/Industrial Engineering, Engineering Management, or Reliability Engineering.
  • Hyperscale/colo experience with reliability-centered maintenance, predictive analytics, and Lean/Six Sigma practices.
  • Familiarity with Linux fundamentals, network equipment installation/troubleshooting, and fiber optics testing.
  • Experience with Jira, Confluence, ServiceNow (or similar); strong SOP/MOP/EOP authorship.
  • Certifications such as CDCP, DCM, PMP, OSHA-30, ITIL, or Uptime-aligned credentials.

Benefits

  • Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
  • 401(k) plan: up to 4% company match with immediate vesting.
  • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
  • Remote work reimbursement: up to $85/month for mobile and internet.
  • Disability & life insurance: company-paid short-term, long-term and life insurance coverage.

Compensation

We offer competitive salaries, ranging from $90k-$140k base + quarterly performance bonuses.