Data Center Site Manager

at Nebius

📍 Philadelphia, United States

USD 90,000-140,000 per year

MIDDLE

✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Security @ 3 Linux @ 2 SQL @ 5 Leadership @ 3 Communication @ 3 Jira @ 5 ServiceNow @ 5 QA @ 3 Engineering Management @ 3 Compliance @ 3 Cloud Computing @ 3 AI @ 3 Change Management @ 3

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. We create tools and resources for customers to solve real-world challenges and transform industries without massive infrastructure costs or large in-house AI/ML teams. Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team includes more than 400 engineers and an in-house AI R&D team.

Responsibilities

Own the site 24/7: deliver continuous availability across power, cooling, structured cabling, network, security, and DCIM—meeting or beating global SLAs.
Build and lead the team: hire, mentor, and develop managers/technicians; run staffing models, shift coverage, and on-call rotations that scale.
Be the incident commander: lead major events end-to-end—triage, communications, executive briefings, RCA, and durable corrective actions.
Drive reliability engineering: implement RCM, predictive maintenance, QA/QC, 5S, and Lean/continuous improvement to cut MTTR and raise MTBF.
Deliver capacity on time: plan and execute expansions/retrofits; commission MEP systems with Design/Construction; achieve flawless change control (MOP/SOP/EOP).
Scale tooling & automation: mature DCIM/BMS/EPMS, monitoring/alerting, work management (Jira/ServiceNow), knowledge base (Confluence), and light scripting/SQL for telemetry and workflow automation.
Run a metrics-first operation: publish dashboards and KPIs (availability, PUE, MTBF/MTTR, work compliance, safety) and use them to drive decisions.
Partner across functions: work with Cloud/Compute, Network, Security, and Capacity Planning to optimize performance, cost, and resiliency across the fleet.
Manage vendors & colos: own contracts, SLAs, and execution for rack deliveries, PDUs, fiber/copper, and lifecycle PMs; validate colo topology and compliance.
Raise the safety bar: enforce a zero-injury EHS culture; conduct drills/audits for life safety, physical security, and data protection.
Forecast and budget: build data-backed plans for power, spares, headcount, and projects; track OpEx/CapEx with rigor.

Requirements

Associate's degree or trade certification in Electrical/Mechanical/Industrial Engineering (or equivalent experience).
10+ years in electrical/mechanical/HVAC/controls within industrial/commercial settings, 5+ years specifically in data center or mission-critical facilities.
Team leadership experience in 24/7 sites (managing leads/techs, vendors, and on-call operations).
Deep, hands-on knowledge of UPS/generators/switchgear, chillers/CRAC/CRAH, fire detection/suppression, BMS/EPMS/DCIM, and structured cabling (copper & fiber).
Proven strength in incident management, RCA/Corrective Actions, change management, and vendor/contract oversight.
Data-driven mindset with the ability to forecast resources and make analytics-backed decisions (Excel; SQL/scripting a plus).
Excellent written/verbal communication with comfort presenting to executives and guiding field teams during live events.
Ability to travel up to ~25% and support after-hours escalations when needed.

Nice to have

Bachelor's degree in Electrical/Mechanical/Industrial Engineering, Engineering Management, or Reliability Engineering.
Hyperscale/colo experience with reliability-centered maintenance, predictive analytics, and Lean/Six Sigma practices.
Familiarity with Linux fundamentals, network equipment installation/troubleshooting, and fiber optics testing.
Experience with Jira, Confluence, ServiceNow (or similar); strong SOP/MOP/EOP authorship.
Certifications such as CDCP, DCM, PMP, OSHA-30, ITIL, or Uptime-aligned credentials.

Benefits

Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) plan: up to 4% company match with immediate vesting.
Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote work reimbursement: up to $85/month for mobile and internet.
Disability & life insurance: company-paid short-term, long-term and life insurance coverage.

Compensation

We offer competitive salaries, ranging from $90k-$140k base + quarterly performance bonuses.