Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Ansible @ 2
Ceph @ 6
Docker @ 3
Grafana @ 3
Kubernetes @ 3
Linux @ 3
Prometheus @ 3
Terraform @ 2
Python @ 6
CI/CD @ 3
MLOps @ 3
Bash @ 6
Communication @ 3
Helm @ 2
Networking @ 3
KubeFlow @ 3
MLFlow @ 3
PyTorch @ 3
CUDA @ 5
Cloud Computing @ 3
GPU @ 3
Observability @ 3
AI @ 3
InfiniBand @ 3
Data Pipelines @ 3
NCCL @ 5
Slurm @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources customers need to solve real-world AI/ML challenges at massive scale. Nebius is headquartered in Amsterdam with R&D hubs across Europe, North America, and Israel and a team of over 800 employees.
You are welcome to work remotely from the United States or Canada.
Role overview
We are seeking a Specialist HPC Infrastructure Solutions Architect to design, build, and optimize next-generation high-performance computing (HPC) platforms for AI, simulation, and large-scale data processing workloads. The role sits at the intersection of infrastructure engineering, accelerated computing, and AI systems design and requires hands-on experience implementing NVIDIA GPU-based compute environments, Kubernetes orchestration, networking, and MLOps toolchains.
Responsibilities
- Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm).
- Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper and Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE interconnects.
- Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components.
- Design and validate cloud HPC environments focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling.
- Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling.
- Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies.
- Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers.
- Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews, and customer engagements.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (Ph.D. a plus).
- 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
- Expertise in Linux systems, Kubernetes, container runtimes (CRI-O, Docker), and related CI/CD practices.
- Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch, RoCE).
- Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage).
- Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
- Strong scripting skills in Python or Bash for automation and tool integration.
- Excellent communication and documentation skills; ability to lead design reviews and customer engagements.
Nice to have / Added bonus
- Proficient with NVIDIA GPU ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management.
- Experience designing or managing AI/ML pipelines via MLflow, Kubeflow, NeMo, or similar frameworks.
- Experience with cloud-native HPC offerings and schedulers (Slurm, LFS, PBS, etc.).
- Background in designing multi-tenant GPU infrastructures or AI training farms.
- Exposure to distributed ML frameworks (PyTorch DDP, DeepSpeed, Megatron).
- Knowledge of observability for HPC (Prometheus, DCGM Exporter, Grafana, NVIDIA NGC monitoring tools).
- Contributions to open-source HPC/CUDA/Kubernetes projects are a strong plus.
Compensation
We offer competitive salaries, ranging from 225k - 315k OTE (On-Target Earnings) and equity based on experience, skills, and location.
Benefits
- Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
- 401(k) plan: Up to 4% company match with immediate vesting.
- Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
- Remote work reimbursement: Up to $85/month for mobile and internet.
- Disability & life insurance: Company-paid short-term, long-term, and life insurance coverage.
- Flexible working arrangements and opportunities for professional growth within Nebius.