Principal ML Solutions Architect - Token Factory

at Nebius
USD 208,000-261,000 per year
SENIOR
✅ Remote

Used Tools & Technologies

Machine Learning GenAI

Required Skills & Competences

Docker @ 6 Kubernetes @ 6 DevOps @ 6 Python @ 7 GCP @ 4 MLOps @ 4 Vertex AI @ 4 Hiring @ 4 Leadership @ 6 AWS @ 4 Azure @ 4 Communication @ 4 Git @ 4 Mentoring @ 4 Debugging @ 4 API @ 4 Technical Leadership @ 6 OSS @ 4 LLM @ 4 GPU @ 4 Generative AI @ 7 AI @ 4 vLLM @ 4 RAG @ 4 TensorRT @ 4 SGLang @ 4 Prompt Engineering @ 4

Details

About Nebius

Nebius is building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment. The company focuses on large-scale GPU orchestration, inference optimization, and applied AI. Listed on Nasdaq (NBIS) and headquartered in Amsterdam, Nebius has R&D hubs across Europe, the UK, North America and Israel and a global team of 1,500+ engineers.

Role overview

This role sits within Nebius Token Factory, a serverless platform for running and customizing open-source LLMs in production. Token Factory supports serverless inference and fine-tuning (LoRA, FT, RFT) and uses in-house optimizations such as custom speculative decoding, quantization, cache-aware routing and dedicated endpoints. The Principal ML Solutions Architect will be the most senior technical authority for customers leveraging Token Factory, owning complex customer engagements, solving performance and quality problems end-to-end, mentoring other Solutions Architects, and shaping the platform roadmap with product, backend, and research teams. You are welcome to work remotely from the United States.

Responsibilities

  • Own the most complex, highest-stakes customer engagements from architecture through production across multiple modalities, driving measurable business value.
  • Optimize LLM inference at the framework and hardware level and codify best practices into reusable playbooks for the team.
  • Lead supervised and reinforcement fine-tuning efforts to maximize model quality.
  • Design and implement production-ready LLM solutions using Token Factory's inference services.
  • Provide deep technical expertise in prompt engineering, RAG architectures, model selection, and cost/performance trade-offs at scale.
  • Partner closely with product, engineering and research to surface customer needs, prototype platform features, and influence the roadmap.
  • Guide customers from PoC to production with a focus on performance, reliability, and cost efficiency, and define team standards.
  • Mentor Senior and mid-level Solutions Architects; raise the technical bar through review, enablement, and knowledge sharing.
  • Represent Token Factory externally through talks, blog posts, and conferences.

Requirements

  • 8+ years of experience in ML/AI systems, with at least 4 years focused on LLMs and generative AI.
  • Demonstrated technical leadership: owning ambiguous, high-impact problems end to end and influencing decisions across teams and customers.
  • Expert knowledge of the LLM ecosystem: model architectures, fine-tuning approaches, and inference internals.
  • Deep, hands-on command of inference optimization: quantization, KV-cache management, batching, routing, etc.
  • Hands-on experience running LLMs in production at scale: deploying, operating, and debugging inference workloads down to the framework level.
  • Hands-on LLM fine-tuning, including SFT/LoRA and data preparation/curation; experience with RL-based fine-tuning.
  • LLM evaluation experience: building task-specific benchmarks and offline/online eval pipelines, including LLM-as-a-judge setups.
  • Experience with inference frameworks and libraries (vLLM, SGLang, TensorRT-LLM), including the ability to read and modify their internals.
  • Experience deploying LLM-powered applications using APIs from OpenAI, Anthropic, or open-source models.
  • Strong Python programming skills.
  • Excellent communication skills, with the ability to explain technical concepts to diverse audiences (engineers to executives).

It would be an added bonus if you have

  • Contributions or maintainership in major OSS inference/ML projects (vLLM, SGLang, TensorRT-LLM).
  • Published research, conference talks, or widely-read technical writing in the LLM/serving space.
  • Deep work with multimodal AI models (vision-language, speech).
  • Proficiency with DevOps tooling (Docker, Kubernetes) and infrastructure-as-code.
  • Experience building or owning internal tooling/automation for ML workflows at scale.

Preferred technical stack

  • Programming languages: Python
  • ML frameworks and libraries: vLLM, TensorRT-LLM, SGLang, Transformers, OpenAI/Anthropic SDKs
  • MLOps and DevOps tools: Kubernetes, Docker, Git
  • Cloud platforms: AWS (SageMaker, Bedrock), GCP (Vertex AI), Azure (Azure ML)

Compensation

Base compensation range: $208,000 — $261,000 USD (actual compensation determined by experience, skills, qualifications, hiring level, and geographic location).

Key employee benefits

  • 100% company-paid medical, dental, and vision coverage for employees and families.
  • 401(k) plan with up to 4% company match and immediate vesting.
  • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
  • Remote work reimbursement: up to $85/month for mobile and internet.
  • Company-paid short-term, long-term, and life insurance.
  • Career growth, flexibility, collaborative culture, and opportunity to work on impactful AI projects.

Equal opportunity & eligibility

Nebius is an equal opportunity employer committed to an inclusive workplace. Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire. If you need accommodations during the application process, the company asks to be informed.