Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 6 GCP @ 4 TensorFlow @ 3 AWS @ 4 Azure @ 4 Networking @ 4 LLM @ 4 PyTorch @ 3 CUDA @ 6Details
NVIDIA DGX™ Cloud is an end-to-end, scalable AI platform for developers, offering scalable capacity built on the latest NVIDIA architecture and co-engineered with the world’s leading cloud service providers (CSPs). This role focuses on performance analysis, optimization, and modeling to define the architecture and design of NVIDIA's DGX Cloud clusters.
Responsibilities
- Develop benchmarks and end-to-end customer applications running at scale, instrumented for performance measurements, tracking, and sampling to measure and optimize performance.
- Construct experiments to analyze and identify performance bottlenecks and dependencies from an end-to-end perspective.
- Propose improvements to system performance and usability by leading hardware and software changes.
- Collaborate with external CSPs during cluster deployment and workload optimization.
- Work with AI researchers, developers, and application providers to understand requirements and share best practices.
- Engage with a diverse set of LLM workloads in areas such as healthcare, climate modeling, pharmaceuticals, financial futures, genomics, and drug discovery.
- Develop modeling frameworks and TCO analysis for efficient exploration of architecture and design space.
- Drive engineering analysis to advise DGX Cloud architecture, design, and roadmap.
Requirements
- 12+ years of proven experience.
- Expertise in working with large scale parallel and distributed accelerator-based systems.
- Strong background optimizing performance and AI workloads on large-scale systems.
- Experience with performance modeling and benchmarking at scale.
- Knowledge in Computer Architecture, Networking, Storage systems, and Accelerators.
- Familiarity with AI frameworks such as PyTorch, TensorFlow, JAX, Megatron-LM, Tensor-LLM, VLLM.
- Experience with AI/ML models, especially LLMs.
- Understanding of DNNs and their use in AI/ML applications.
- Bachelor’s or Master’s degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience.
- Proficiency in Python and C/C++.
- Expertise with at least one public CSP infrastructure (GCP, AWS, Azure, OCI, etc.).
Ways to Stand Out
- High intellectual curiosity and confidence to engage with complex problems.
- Proficiency in CUDA and XLA.
- Excellent interpersonal skills.
- PhD is a plus.
Competitive salaries and a generous benefits package are offered. NVIDIA values diversity and is an equal opportunity employer.