Senior DGX Cloud Performance Engineer

at Nvidia
USD 224,000-425,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Python @ 6 GCP @ 4 TensorFlow @ 3 AWS @ 4 Azure @ 4 Networking @ 4 LLM @ 4 PyTorch @ 3 CUDA @ 6

Details

NVIDIA DGX™ Cloud is an end-to-end, scalable AI platform for developers, offering scalable capacity built on the latest NVIDIA architecture and co-engineered with the world’s leading cloud service providers (CSPs). This role focuses on performance analysis, optimization, and modeling to define the architecture and design of NVIDIA's DGX Cloud clusters.

Responsibilities

  • Develop benchmarks and end-to-end customer applications running at scale, instrumented for performance measurements, tracking, and sampling to measure and optimize performance.
  • Construct experiments to analyze and identify performance bottlenecks and dependencies from an end-to-end perspective.
  • Propose improvements to system performance and usability by leading hardware and software changes.
  • Collaborate with external CSPs during cluster deployment and workload optimization.
  • Work with AI researchers, developers, and application providers to understand requirements and share best practices.
  • Engage with a diverse set of LLM workloads in areas such as healthcare, climate modeling, pharmaceuticals, financial futures, genomics, and drug discovery.
  • Develop modeling frameworks and TCO analysis for efficient exploration of architecture and design space.
  • Drive engineering analysis to advise DGX Cloud architecture, design, and roadmap.

Requirements

  • 12+ years of proven experience.
  • Expertise in working with large scale parallel and distributed accelerator-based systems.
  • Strong background optimizing performance and AI workloads on large-scale systems.
  • Experience with performance modeling and benchmarking at scale.
  • Knowledge in Computer Architecture, Networking, Storage systems, and Accelerators.
  • Familiarity with AI frameworks such as PyTorch, TensorFlow, JAX, Megatron-LM, Tensor-LLM, VLLM.
  • Experience with AI/ML models, especially LLMs.
  • Understanding of DNNs and their use in AI/ML applications.
  • Bachelor’s or Master’s degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience.
  • Proficiency in Python and C/C++.
  • Expertise with at least one public CSP infrastructure (GCP, AWS, Azure, OCI, etc.).

Ways to Stand Out

  • High intellectual curiosity and confidence to engage with complex problems.
  • Proficiency in CUDA and XLA.
  • Excellent interpersonal skills.
  • PhD is a plus.

Competitive salaries and a generous benefits package are offered. NVIDIA values diversity and is an equal opportunity employer.