Principal Software Engineer, E2E Performance and Goodput — CSP Engagements

at Nvidia

📍 Santa Clara, United States

USD 272,000-431,200 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Python @ 7 Leadership @ 4 Communication @ 4 Data Analysis @ 7 LLM @ 4 Pandas @ 7 CUDA @ 4 GPU @ 4 AI @ 4 Profiling @ 4 vLLM @ 4 NCCL @ 4 TensorRT @ 4 SGLang @ 4 HPC @ 8 Performance Analysis @ 4 NVLink @ 3

Details

We are looking for a Principal Engineer to join the CSP Engagements team as the technical focal point for end-to-end performance. You will work directly with engineering teams of key cloud service provider (CSP) / hyperscale customers to ensure they achieve performance targets on NVIDIA platforms. You will augment NVIDIA's performance and benchmark teams with a dedicated CSP-facing focus, drive work streams with CSP engineering teams, gather workload-specific feedback to influence NVIDIA optimization priorities, and validate performance targets in customer-representative configurations. Your cross-CSP visibility will help identify patterns and drive systemic improvements in documentation, configuration guidance, and tooling.

Responsibilities

Drive performance characterization work streams with engineering teams of key CSP/hyperscale customers — ensure they understand platform performance expectations, profiling methodology, and tuning options for their workloads.
Gather and synthesize CSP performance feedback — identify gaps between expected and actual throughput and champion optimization priorities back into NVIDIA's CUDA, NCCL, driver, and firmware teams.
Ensure key open-source performance and stress tools (e.g., STREAM, GPU Burn, GPU BLAST) are updated and validated for the latest NVIDIA rack-scale systems, GPU architectures, and CPU platforms so customers and internal teams have reliable baseline measurements.
Work closely with CSPs to ensure their performance and validation tooling reflects the latest GPU capabilities, memory hierarchy changes, and platform-specific tuning parameters.
Conduct cross-CSP performance comparison and pattern analysis — identify configuration, software, or workload differences that explain performance gaps between deployments.
Collaborate with CSPs to ensure performance-related integration work (profiling infrastructure, benchmark harnesses, config validation) is ready ahead of deployment milestones.
Define test strategies and tooling requirements for performance validation for both NVIDIA internal certification and customer acceptance.

Requirements

15+ years of experience in systems performance engineering, ideally in GPU/HPC/ML infrastructure. BS or MS in Computer Science, Computer Engineering, or related field (or equivalent experience).
Proficiency in GPU workload profiling: nsight systems, nsight compute, DCGM metrics, or equivalent instrumentation.
Understanding of distributed training performance dynamics: computation/communication overlap, pipeline bubbles, memory bandwidth utilization, collective efficiency.
Knowledge of how the full software stack impacts performance: driver overhead, collective algorithm selection, memory allocation, scheduling, firmware power management.
Statistical methods for performance analysis: regression detection, confidence intervals, A/B comparison at scale.
Strong data analysis and visualization skills (Python, pandas, dashboards).
Ability to communicate performance findings to both deep technical audiences and executive leadership.
Demonstrated success influencing multiple engineering teams to prioritize performance improvements.

Ways to stand out from the crowd

Experience profiling and optimizing distributed training at 1000+ GPU scale (Megatron-LM, DeepSpeed, FSDP).
Background in ML infrastructure performance at a CSP/hyperscaler.
Familiarity with NVIDIA platforms (DGX, HGX, NVLink topology) and profiling tools.
Experience building automated performance regression detection systems for production environments.
Understanding of inference workload performance dynamics (vLLM, TensorRT-LLM, SGLang, continuous batching).

Compensation & Benefits

Base salary range: 272,000 USD - 431,250 USD.
Eligible for equity and benefits (link to NVIDIA benefits referenced in the posting).

Additional information

Applications for this job will be accepted at least until June 30, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.