HPC Specialist Solutions Architect

at Nebius
📍 Canada
📍 United States
USD 225,000-315,000 per year
MIDDLE
✅ Remote

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Ansible @ 2 Ceph @ 6 Docker @ 3 Grafana @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 Terraform @ 2 Python @ 6 CI/CD @ 3 MLOps @ 3 Bash @ 6 Communication @ 3 Helm @ 2 Networking @ 3 KubeFlow @ 3 MLFlow @ 3 PyTorch @ 3 CUDA @ 5 Cloud Computing @ 3 GPU @ 3 Observability @ 3 AI @ 3 InfiniBand @ 3 Data Pipelines @ 3 NCCL @ 5 Slurm @ 3

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources customers need to solve real-world AI/ML challenges at massive scale. Nebius is headquartered in Amsterdam with R&D hubs across Europe, North America, and Israel and a team of over 800 employees.

You are welcome to work remotely from the United States or Canada.

Role overview

We are seeking a Specialist HPC Infrastructure Solutions Architect to design, build, and optimize next-generation high-performance computing (HPC) platforms for AI, simulation, and large-scale data processing workloads. The role sits at the intersection of infrastructure engineering, accelerated computing, and AI systems design and requires hands-on experience implementing NVIDIA GPU-based compute environments, Kubernetes orchestration, networking, and MLOps toolchains.

Responsibilities

  • Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm).
  • Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper and Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE interconnects.
  • Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components.
  • Design and validate cloud HPC environments focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling.
  • Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling.
  • Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies.
  • Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers.
  • Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews, and customer engagements.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (Ph.D. a plus).
  • 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
  • Expertise in Linux systems, Kubernetes, container runtimes (CRI-O, Docker), and related CI/CD practices.
  • Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch, RoCE).
  • Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage).
  • Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
  • Strong scripting skills in Python or Bash for automation and tool integration.
  • Excellent communication and documentation skills; ability to lead design reviews and customer engagements.

Nice to have / Added bonus

  • Proficient with NVIDIA GPU ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management.
  • Experience designing or managing AI/ML pipelines via MLflow, Kubeflow, NeMo, or similar frameworks.
  • Experience with cloud-native HPC offerings and schedulers (Slurm, LFS, PBS, etc.).
  • Background in designing multi-tenant GPU infrastructures or AI training farms.
  • Exposure to distributed ML frameworks (PyTorch DDP, DeepSpeed, Megatron).
  • Knowledge of observability for HPC (Prometheus, DCGM Exporter, Grafana, NVIDIA NGC monitoring tools).
  • Contributions to open-source HPC/CUDA/Kubernetes projects are a strong plus.

Compensation

We offer competitive salaries, ranging from 225k - 315k OTE (On-Target Earnings) and equity based on experience, skills, and location.

Benefits

  • Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
  • 401(k) plan: Up to 4% company match with immediate vesting.
  • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
  • Remote work reimbursement: Up to $85/month for mobile and internet.
  • Disability & life insurance: Company-paid short-term, long-term, and life insurance coverage.
  • Flexible working arrangements and opportunities for professional growth within Nebius.