Deep Learning Solutions Architect – Distributed Training

at Nvidia
GBP 70,000-140,000 per year
MIDDLE SENIOR
✅ Remote

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Marketing @ 3 Software Development @ 5 Python @ 5 Data Science @ 3 Communication @ 3 Mathematics @ 3 Performance Optimization @ 3 HTTP @ 3 NLP @ 3 LLM @ 1 PyTorch @ 6 GPU @ 3

Details

NVIDIA's Worldwide Field Operations (WWFO) team is seeking a Solution Architect with a strong focus on Deep Learning and a deep understanding of neural network training. The introduction of NVIDIA GB200 NVL72 systems, featuring Chip-to-Chip NVLINK and an extended NVLINK domain, has enabled new neural network architectures and training approaches. The ideal candidate will be proficient with tools such as NeMo, Megatron-LM, DeepSpeed, PyTorch FSDP or similar, and have strong system knowledge to help customers maximize the potential of the new Grace Blackwell training systems. Experience with LLM post-training, especially Reinforcement Learning (RL), is a strong plus.

Responsibilities

  • Collaborate directly with key customers to understand their technology and provide expert AI solutions and training guidance regarding tools and methodology.
  • Perform detailed analysis and optimization to ensure maximum performance on GPU architecture systems, particularly Grace/ARM-based systems, including distributed training pipeline optimization.
  • Work with Engineering, Product, and Sales teams to develop and plan the best solutions for customers; support product feature growth through customer feedback and proof-of-concept evaluations.

Requirements

  • Excellent verbal, written communication, and technical presentation skills in English.
  • MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
  • 5+ years of work or research experience with Python, C++, or other software development.
  • Knowledge and work experience with modern NLP, including understanding of transformer, state space, diffusion, MOE model architectures, either in training or optimization/compression/operation of DNNs.
  • Familiarity with key libraries used in NLP/LLM training (Megatron-LM, NeMo, DeepSpeed) and/or deployment (TensorRT-LLM, vLLM, Triton Inference Server).
  • Proven track record in neural network performance optimization and/or training robustness.
  • Comfortable working across multiple teams and organizational levels (Engineering, Product, Sales, Marketing) in a fast-evolving environment.
  • Self-starter with growth mindset, passion for continuous learning, and sharing knowledge within teams.

Ways to Stand Out

  • Ability to conduct LLM post-training, notably large scale RL.
  • Experience running large scale training/HPC jobs focused on training robustness and failure resilience.
  • Familiarity with HPC systems, including data center design, high-speed interconnects like InfiniBand, cluster storage, and scheduling design or management.

Benefits

NVIDIA offers highly competitive salaries and a comprehensive benefits package. For more information, visit www.nvidiabenefits.com.

NVIDIA is an equal opportunity employer committed to diversity and inclusion in its workforce.