AI/ML Specialist Solutions Architect

at Nebius
📍 Canada
📍 United States
USD 225,000-315,000 per year
MIDDLE SENIOR
✅ Remote
Featured

Used Tools & Technologies

Not specified

Required Skills & Competences

Docker @ 3 Go @ 3 Kubernetes @ 3 DevOps @ 3 Terraform @ 3 Python @ 3 Java @ 3 Machine Learning @ 7 MLOps @ 7 scikit-learn @ 3 TensorFlow @ 3 Communication @ 3 Git @ 3 Helm @ 3 PyTorch @ 3 Cloud Computing @ 3 GPU @ 5 AI @ 3 Slurm @ 3

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. The company builds tools and resources to help customers solve real-world AI/ML challenges at scale without massive infrastructure costs or the need for large in-house teams. Nebius is headquartered in Amsterdam, listed on Nasdaq, and has R&D hubs across Europe, North America, and Israel.

This role sits in Customer Experience and focuses on supporting AI-focused customers leveraging Nebius services. You will act as a trusted advisor, collaborate with clients to design scalable AI solutions, resolve technical challenges, and manage large-scale AI deployments involving hundreds to thousands of GPUs. The position is remote-friendly and candidates may work from the United States or Canada.

Responsibilities

  • Design customer-centric solutions that maximize business value and align with strategic goals.
  • Build and maintain long-term customer relationships to foster trust and ensure satisfaction.
  • Deliver technical presentations, produce whitepapers, create manuals, and host webinars for audiences with varying technical expertise.
  • Collaborate with engineering and product teams to prioritize and relay customer feedback.
  • Support customers in optimizing AI solutions at massive GPU cloud scale and influence the development of Nebius AI Cloud.

Requirements

  • 7–10+ years of experience with cloud technologies in MLOps engineering, Machine Learning engineering, or similar roles.
  • Strong understanding of ML ecosystems, including models, use cases, and tooling.
  • Proven experience in setting up and optimizing distributed training pipelines across multi-node and multi-GPU environments.
  • Hands-on knowledge of frameworks such as PyTorch or JAX.
  • Excellent verbal and written communication skills.

It is an added bonus if you have:

  • Expertise in deploying inference infrastructure for production workloads.
  • Ability to transition ML pipelines from proof-of-concept to scalable production systems.

Preferred tooling and technologies mentioned:

  • Programming languages: Python, Go, Java, C++
  • Orchestration: Kubernetes (K8s), Slurm
  • DevOps tools: Git, Docker, Helm
  • Infrastructure as Code: Terraform
  • ML frameworks & libraries: PyTorch, TensorFlow, JAX, HuggingFace, Scikit-learn
  • GPU hardware referenced: H200, B200, GB200

Benefits

  • Competitive salary and comprehensive benefits package.
  • Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
  • 401(k) plan: Up to 4% company match with immediate vesting.
  • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
  • Remote work reimbursement: Up to $85/month for mobile and internet.
  • Disability & life insurance: Company-paid short-term, long-term, and life insurance coverage.
  • Opportunities for professional growth, flexible working arrangements, and equity potential.

Compensation

  • Competitive salaries, ranging from 225k - 315k OTE (On-Target Earnings). Equity may be offered based on experience, skills, and location.