GPU Cluster Architect

at Nebius

📍 United States

USD 150,000-180,000 per year

MIDDLE

✅ Remote ✅ Hybrid

Featured

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 3 Python @ 3 Networking @ 3 LLM @ 3 Cloud Computing @ 3 GPU @ 3 AI @ 3 InfiniBand @ 3

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. We create tools and resources to help customers solve real-world challenges and transform industries without massive infrastructure costs or large in-house AI/ML teams. Nebius is headquartered in Amsterdam with R&D hubs across Europe, North America, and Israel.

Role overview

We are seeking a GPU Cluster Architect to drive the design of our next-generation AI infrastructure. In this high-impact, hands-on role you will make end-to-end architectural decisions across compute, networking, and storage — ensuring platforms meet the scale, performance, and reliability requirements of modern AI workloads. You will define how tens of thousands of GPUs are interconnected, cooled, powered, and optimized across multiple data center sites.

You are welcome to work remotely from the United States.

Responsibilities

Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes.
Analyze AI/ML workloads (e.g., LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density (performance modeling).
Align with network architects on design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and data center scale.
Work with storage teams to optimize performance for training datasets, checkpointing, and related workflows.
Understand and analyze signals from monitoring systems to detect flows and inform design decisions (reliability & monitoring).
Partner with site reliability, networking, storage, and data center engineering teams to operationalize and scale the architecture.

Requirements

5+ years of experience designing clusters.
Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.).
Experience with HPC interconnects (InfiniBand & RoCE).
Solid background in systems architecture, networking, and hardware reliability.
Experience in scripting for automation and telemetry pipelines (Python, Go, etc.).

Compensation

Base salary range: $150,000 - $180,000 per year, plus quarterly performance bonuses.

Benefits (Key US benefits)

Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) plan: Up to 4% company match with immediate vesting.
Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote work reimbursement: Up to $85/month for mobile and internet.
Disability & life insurance: Company-paid short-term, long-term, and life insurance coverage.
Competitive salary and comprehensive benefits package; opportunities for professional growth; hybrid/flexible working arrangements.

Other

Department: Hardware Infrastructure
Headquarters: Amsterdam (company), role available remotely from the United States