Principal Software Engineer, E2E Performance and Goodput — CSP Engagements
at Nvidia
USD 272,000-431,200 per year
Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Python @ 7
Leadership @ 4
Communication @ 4
Data Analysis @ 7
LLM @ 4
Pandas @ 7
CUDA @ 4
GPU @ 4
AI @ 4
Profiling @ 4
vLLM @ 4
NCCL @ 4
TensorRT @ 4
SGLang @ 4
HPC @ 8
Performance Analysis @ 4
NVLink @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
We are looking for a Principal Engineer to join the CSP Engagements team as the technical focal point for end-to-end performance. You will work directly with engineering teams of key cloud service provider (CSP) / hyperscale customers to ensure they achieve performance targets on NVIDIA platforms. You will augment NVIDIA's performance and benchmark teams with a dedicated CSP-facing focus, drive work streams with CSP engineering teams, gather workload-specific feedback to influence NVIDIA optimization priorities, and validate performance targets in customer-representative configurations. Your cross-CSP visibility will help identify patterns and drive systemic improvements in documentation, configuration guidance, and tooling.
Responsibilities
- Drive performance characterization work streams with engineering teams of key CSP/hyperscale customers — ensure they understand platform performance expectations, profiling methodology, and tuning options for their workloads.
- Gather and synthesize CSP performance feedback — identify gaps between expected and actual throughput and champion optimization priorities back into NVIDIA's CUDA, NCCL, driver, and firmware teams.
- Ensure key open-source performance and stress tools (e.g., STREAM, GPU Burn, GPU BLAST) are updated and validated for the latest NVIDIA rack-scale systems, GPU architectures, and CPU platforms so customers and internal teams have reliable baseline measurements.
- Work closely with CSPs to ensure their performance and validation tooling reflects the latest GPU capabilities, memory hierarchy changes, and platform-specific tuning parameters.
- Conduct cross-CSP performance comparison and pattern analysis — identify configuration, software, or workload differences that explain performance gaps between deployments.
- Collaborate with CSPs to ensure performance-related integration work (profiling infrastructure, benchmark harnesses, config validation) is ready ahead of deployment milestones.
- Define test strategies and tooling requirements for performance validation for both NVIDIA internal certification and customer acceptance.
Requirements
- 15+ years of experience in systems performance engineering, ideally in GPU/HPC/ML infrastructure. BS or MS in Computer Science, Computer Engineering, or related field (or equivalent experience).
- Proficiency in GPU workload profiling: nsight systems, nsight compute, DCGM metrics, or equivalent instrumentation.
- Understanding of distributed training performance dynamics: computation/communication overlap, pipeline bubbles, memory bandwidth utilization, collective efficiency.
- Knowledge of how the full software stack impacts performance: driver overhead, collective algorithm selection, memory allocation, scheduling, firmware power management.
- Statistical methods for performance analysis: regression detection, confidence intervals, A/B comparison at scale.
- Strong data analysis and visualization skills (Python, pandas, dashboards).
- Ability to communicate performance findings to both deep technical audiences and executive leadership.
- Demonstrated success influencing multiple engineering teams to prioritize performance improvements.
Ways to stand out from the crowd
- Experience profiling and optimizing distributed training at 1000+ GPU scale (Megatron-LM, DeepSpeed, FSDP).
- Background in ML infrastructure performance at a CSP/hyperscaler.
- Familiarity with NVIDIA platforms (DGX, HGX, NVLink topology) and profiling tools.
- Experience building automated performance regression detection systems for production environments.
- Understanding of inference workload performance dynamics (vLLM, TensorRT-LLM, SGLang, continuous batching).
Compensation & Benefits
- Base salary range: 272,000 USD - 431,250 USD.
- Eligible for equity and benefits (link to NVIDIA benefits referenced in the posting).
Additional information
- Applications for this job will be accepted at least until June 30, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.