Used Tools & Technologies
Machine Learning LLMRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 6
Python @ 6
GCP @ 3
TensorFlow @ 3
AWS @ 3
Azure @ 3
Bash @ 6
Communication @ 4
Debugging @ 4
PyTorch @ 3
CUDA @ 4
Cloud Computing @ 3
GPU @ 4
Deep Learning @ 3
AI @ 4
InfiniBand @ 3
Robotics @ 4
NCCL @ 4
HPC @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
We are seeking a Senior AI/ML Performance and Efficiency Engineer, GPU Clusters at NVIDIA to join our AI Efficiency efforts. As an engineer you will play a pivotal role in enhancing efficiency for researchers by implementing improvements across the entire stack, collaborating with customers to identify and address infrastructure and application inefficiencies to enable scalable AI/ML research on GPU clusters.
Responsibilities
- Collaborate closely with AI/ML researchers to make ML models more efficient, delivering productivity improvements and cost savings.
- Build tools, frameworks, and apply ML techniques to detect and analyze efficiency bottlenecks and deliver productivity improvements for researchers.
- Work with researchers on a variety of ML workloads across robotics, autonomous vehicles, large language models (LLMs), video, and more.
- Collaborate across engineering organizations to deliver efficiency in hardware, software, and infrastructure usage.
- Proactively monitor fleet-wide utilization patterns, analyze existing inefficiency patterns or discover new ones, and deliver scalable solutions.
- Keep up to date with recent developments in AI/ML technologies, frameworks, and successful strategies and advocate for their integration.
Requirements
- BS or equivalent background in Computer Science or related area (or equivalent experience).
- Minimum 5+ years of experience designing and operating large-scale compute infrastructure.
- Strong understanding of modern ML techniques and tools.
- Experience investigating and resolving training and inference performance end-to-end.
- Debugging and optimization experience with NSight Systems and NSight Compute.
- Experience debugging large-scale distributed training using NCCL.
- Proficiency in programming and scripting languages such as Python, Go, and Bash.
- Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure).
- Experience with parallel computing frameworks and paradigms.
- Dedication to ongoing learning and staying updated on AI/ML infrastructure technologies.
- Excellent communication and collaboration skills.
Ways to stand out / Preferred
- Background with NVIDIA GPUs and CUDA programming.
- Experience with NCCL and MLPerf benchmarking.
- Familiarity with InfiniBand (IBOP) and RDMA.
- Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
- Familiarity with deep learning frameworks such as PyTorch and TensorFlow.
Compensation & Benefits
- Base salary ranges (location, level, and experience dependent):
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
- Eligible for equity and a comprehensive benefits package. Link to benefits: https://www.nvidia.com/en-us/benefits/
Other
- Applications accepted at least until March 23, 2026.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.