Principal High-Performance LLM Training Engineer

at Nvidia

📍 Santa Clara, United States

USD 272,000-431,200 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning HPC

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Distributed Systems @ 4 Leadership @ 4 Communication @ 4 Networking @ 4 Performance Optimization @ 4 Product Management @ 4 Technical Leadership @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 Reinforcement Learning @ 7 Profiling @ 7 Performance Analysis @ 4 JAX @ 4

Details

NVIDIA is seeking a Principal Engineer to drive the performance of large-scale AI training and post-training workloads across NVIDIA’s full hardware and software stack. This role sits at the intersection of distributed training, GPU architecture, systems software, deep learning frameworks, and performance engineering. You will analyze and optimize frontier-scale LLM workloads running on thousands of GPUs, drive improvements across frameworks such as PyTorch, JAX, NeMo, and NeMo RL, and use insights from real workloads to help shape future NVIDIA GPU, system, and software roadmaps.

Responsibilities

Lead end-to-end performance analysis and optimization of innovative LLM pre-training and post-training workloads on the latest NVIDIA hardware and software platforms.
Identify and remove bottlenecks across compute, memory, communication, scheduling, parallelism strategy, kernel efficiency, framework overhead, and system-level scaling to drive workloads closer to speed-of-light performance.
Develop production-quality software, tools, models, benchmarks, and analysis infrastructure that improve training performance, efficiency, and developer velocity across NVIDIA’s AI software stack.
Build and refine performance models, workload characterizations, and simulation methodologies to guide future GPU, networking, system, and software architecture decisions.
Serve as a technical authority for AI training performance, partnering closely with teams across GPU architecture, systems, CUDA libraries, compilers, networking, frameworks, product management, and applied AI.
Translate workload insights into concrete hardware and software recommendations, and advocate for changes that improve performance and efficiency across the AI ecosystem.
Mentor and provide technical leadership to engineers across the organization, helping establish best practices for large-scale AI performance analysis and optimization.

Requirements

MS, or PhD (or equivalent experience) in Computer Science, Electrical Engineering, Computer Engineering, or a related field, with 12+ years of relevant work or research experience.
Demonstrated principal-level technical impact in one or more of: large-scale AI training systems, GPU performance optimization, distributed systems, high-performance computing, ML frameworks, compilers/runtimes, or hardware/software co-design.
Deep hands-on experience analyzing and optimizing performance of large-scale deep learning workloads, especially transformer-based models, LLM pre-training, reinforcement learning, fine-tuning, or other post-training workloads.
Strong understanding of GPU and AI accelerator architecture from individual accelerators to datacenter-scale systems.
Experience with distributed training techniques such as data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, sequence parallelism, activation checkpointing, mixed precision training, and communication/computation overlap.
Strong track record of using profiling, tracing, benchmarking, and performance modeling tools to diagnose complex bottlenecks and drive measurable improvements.
Excellent communication and technical leadership skills, with the ability to influence architecture and software decisions across multiple teams without relying on direct authority.

Technologies & Systems Mentioned

PyTorch, JAX, NeMo, NeMo RL
CUDA libraries, compilers / runtimes
GPU architecture, memory systems, networking, communication collectives
Distributed training strategies and parallelism techniques
Profiling, tracing, benchmarking, and performance modeling tools
Transformer-based models, LLM pre-training, reinforcement learning, fine-tuning

Compensation & Other Details

Base salary range: 272,000 USD - 431,250 USD (determined by location, experience, and comparable pay).
Eligible for equity and benefits (link to NVIDIA benefits referenced in original posting).
Applications accepted at least until May 2, 2026.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.