Deep Learning Solutions Architect – Distributed Training
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Marketing @ 3 Software Development @ 5 Python @ 5 Data Science @ 3 Communication @ 3 Mathematics @ 3 Performance Optimization @ 3 HTTP @ 3 NLP @ 3 LLM @ 1 PyTorch @ 6 GPU @ 3Details
NVIDIA's Worldwide Field Operations (WWFO) team is seeking a Solution Architect with a strong focus on Deep Learning and a deep understanding of neural network training. The introduction of NVIDIA GB200 NVL72 systems, featuring Chip-to-Chip NVLINK and an extended NVLINK domain, has enabled new neural network architectures and training approaches. The ideal candidate will be proficient with tools such as NeMo, Megatron-LM, DeepSpeed, PyTorch FSDP or similar, and have strong system knowledge to help customers maximize the potential of the new Grace Blackwell training systems. Experience with LLM post-training, especially Reinforcement Learning (RL), is a strong plus.
Responsibilities
- Collaborate directly with key customers to understand their technology and provide expert AI solutions and training guidance regarding tools and methodology.
- Perform detailed analysis and optimization to ensure maximum performance on GPU architecture systems, particularly Grace/ARM-based systems, including distributed training pipeline optimization.
- Work with Engineering, Product, and Sales teams to develop and plan the best solutions for customers; support product feature growth through customer feedback and proof-of-concept evaluations.
Requirements
- Excellent verbal, written communication, and technical presentation skills in English.
- MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
- 5+ years of work or research experience with Python, C++, or other software development.
- Knowledge and work experience with modern NLP, including understanding of transformer, state space, diffusion, MOE model architectures, either in training or optimization/compression/operation of DNNs.
- Familiarity with key libraries used in NLP/LLM training (Megatron-LM, NeMo, DeepSpeed) and/or deployment (TensorRT-LLM, vLLM, Triton Inference Server).
- Proven track record in neural network performance optimization and/or training robustness.
- Comfortable working across multiple teams and organizational levels (Engineering, Product, Sales, Marketing) in a fast-evolving environment.
- Self-starter with growth mindset, passion for continuous learning, and sharing knowledge within teams.
Ways to Stand Out
- Ability to conduct LLM post-training, notably large scale RL.
- Experience running large scale training/HPC jobs focused on training robustness and failure resilience.
- Familiarity with HPC systems, including data center design, high-speed interconnects like InfiniBand, cluster storage, and scheduling design or management.
Benefits
NVIDIA offers highly competitive salaries and a comprehensive benefits package. For more information, visit www.nvidiabenefits.com.
NVIDIA is an equal opportunity employer committed to diversity and inclusion in its workforce.