Deep Learning Solutions Architect – Distributed Training

at Nvidia

📍 United Kingdom

GBP 70,000-140,000 per year

MIDDLE SENIOR

✅ Remote

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Marketing @ 3 Software Development @ 5 Python @ 5 Data Science @ 3 Communication @ 3 Mathematics @ 3 Performance Optimization @ 3 HTTP @ 3 NLP @ 3 LLM @ 1 PyTorch @ 6 GPU @ 3

Details

NVIDIA's Worldwide Field Operations (WWFO) team is seeking a Solution Architect with a strong focus on Deep Learning and a deep understanding of neural network training. The introduction of NVIDIA GB200 NVL72 systems, featuring Chip-to-Chip NVLINK and an extended NVLINK domain, has enabled new neural network architectures and training approaches. The ideal candidate will be proficient with tools such as NeMo, Megatron-LM, DeepSpeed, PyTorch FSDP or similar, and have strong system knowledge to help customers maximize the potential of the new Grace Blackwell training systems. Experience with LLM post-training, especially Reinforcement Learning (RL), is a strong plus.

Responsibilities

Collaborate directly with key customers to understand their technology and provide expert AI solutions and training guidance regarding tools and methodology.
Perform detailed analysis and optimization to ensure maximum performance on GPU architecture systems, particularly Grace/ARM-based systems, including distributed training pipeline optimization.
Work with Engineering, Product, and Sales teams to develop and plan the best solutions for customers; support product feature growth through customer feedback and proof-of-concept evaluations.

Requirements

Excellent verbal, written communication, and technical presentation skills in English.
MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
5+ years of work or research experience with Python, C++, or other software development.
Knowledge and work experience with modern NLP, including understanding of transformer, state space, diffusion, MOE model architectures, either in training or optimization/compression/operation of DNNs.
Familiarity with key libraries used in NLP/LLM training (Megatron-LM, NeMo, DeepSpeed) and/or deployment (TensorRT-LLM, vLLM, Triton Inference Server).
Proven track record in neural network performance optimization and/or training robustness.
Comfortable working across multiple teams and organizational levels (Engineering, Product, Sales, Marketing) in a fast-evolving environment.
Self-starter with growth mindset, passion for continuous learning, and sharing knowledge within teams.

Ways to Stand Out

Ability to conduct LLM post-training, notably large scale RL.
Experience running large scale training/HPC jobs focused on training robustness and failure resilience.
Familiarity with HPC systems, including data center design, high-speed interconnects like InfiniBand, cluster storage, and scheduling design or management.

Benefits

NVIDIA offers highly competitive salaries and a comprehensive benefits package. For more information, visit www.nvidiabenefits.com.

NVIDIA is an equal opportunity employer committed to diversity and inclusion in its workforce.