Senior Data Center Deployment Engineer

at Nebius

📍 United States

USD 125,000-180,000 per year

SENIOR

✅ Remote

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Linux @ 4 Leadership @ 4 Communication @ 4 Networking @ 4 Technical Leadership @ 4 Cloud Computing @ 4 GPU @ 4 AI @ 4 HPC @ 4

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside experienced and innovative leaders and engineers.

Where we work

Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team includes over 800 employees with more than 400 engineers across hardware and software engineering and an in-house AI R&D team.

Role description

Nebius operates large-scale, GPU-dense AI infrastructure across mission-critical data center environments. As a Senior Delivery Deployment Engineer, you will own the end-to-end delivery, deployment, and production readiness of next-generation GPU platforms inside our data centers. This role sits at the intersection of hardware, Linux systems, and operational execution. You will lead on-site rack bring-up, validate NVIDIA-based AI systems, coordinate repairs, and ensure GB-series infrastructure moves from installation to fully operational production environments. You will collaborate closely with hardware engineering, networking, and infrastructure teams to deploy and stabilize H200 and B200-based GPU systems at scale.

Responsibilities

Lead end-to-end deployment of GB-series racks within data center environments
Oversee installation, bring-up, validation, and production readiness of NVIDIA H200 and B200-based servers
Troubleshoot complex hardware, firmware, Linux OS, and networking issues
Execute structured testing and validation procedures during deployment
Develop and maintain basic Linux-based hardware health-check and diagnostic scripts
Coordinate on-site hardware repairs, part replacements, and vendor escalations
Drive root cause analysis and ensure corrective actions are implemented
Manage and prioritize deployment timelines across multiple concurrent rollouts
Provide technical leadership and guidance to on-site engineers and technicians
Partner with networking and infrastructure teams to ensure seamless integration
Document deployment processes, validation standards, and operational runbooks

Requirements

Strong hands-on experience deploying and operating data center infrastructure
Deep familiarity with GPU-dense systems, ideally NVIDIA H-series platforms
Experience working with high-density rack deployments (GB-series or similar)
Solid Linux experience, including troubleshooting and scripting
Ability to diagnose issues across hardware, OS, firmware, and network layers
Experience coordinating field repairs and working directly with hardware vendors
Proven experience leading technical teams or overseeing field operations
High ownership mindset and ability to operate in production-critical environments
Clear communication skills and ability to collaborate across distributed teams

Nice to have

Experience deploying AI or HPC clusters at scale
Familiarity with automated provisioning or infrastructure lifecycle systems
Background in hardware qualification, burn-in testing, or factory validation
Experience supporting rapid infrastructure expansion
Exposure to ARM-based or heterogeneous compute environments

Working conditions

Fully remote position (United States)
Collaboration with globally distributed engineering and operations teams

Benefits

Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families
401(k) plan: up to 4% company match with immediate vesting
Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers
Remote work reimbursement: up to $85/month for mobile and internet
Disability & life insurance: company-paid short-term, long-term, and life insurance coverage
Competitive salary and comprehensive benefits package; opportunities for professional growth and flexible working arrangements

Compensation

We offer competitive salaries, ranging from $125k- $180k base + quarterly performance bonuses.