Used Tools & Technologies
Not specified
Required Skills & Competences ?
Marketing @ 7 Docker @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Machine Learning @ 4 MLOps @ 4 Data Science @ 4 TensorFlow @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 Networking @ 4 IaaS @ 4 Debugging @ 4 API @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 3 GPU @ 4Details
NVIDIA is looking for a Field Escalation Solution Architect with experience in validation and debugging of large-scale GPU clusters focused on performance. As part of the Solution Architecture organization, you will work with cutting-edge computing hardware and software, driving deep learning and machine learning breakthroughs with NVIDIA’s enterprise customers. Primary responsibilities include validating and debugging customer cluster performance issues, identifying functional bottlenecks, and driving customer technical engagements around NVIDIA products and technologies.
Responsibilities
- Stay up to date on High Performance Computing (HPC), Deep Learning, and Machine Learning ecosystems.
- Architect and scale high-performance, distributed AI infrastructure on-prem or in the cloud built with NVIDIA GPU supercomputers for new and existing customers.
- Address and resolve problems from bare metal up through operating system, software stack, and application level.
- Deliver demos, assist with proof-of-concepts, and write papers and developer blogs to share knowledge across teams.
- Collaborate with executives and engineering to address sophisticated problems and bring NVIDIA technologies to life in the cloud and datacenter.
- Work directly with developers and hardware architects to debug cluster performance issues, identify requirements, cross-train account solution architects, and improve workflows.
- Support account teams when extra analysis is required for debugging customer issues and provide expertise to make account and product engineering teams more effective.
- Build custom product demonstrations and POCs addressing critical customer business needs.
Requirements
- BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or other Engineering fields, or equivalent experience.
- 8+ years of work-related experience in NVIDIA and/or accelerated computing technologies.
- Platform-level understanding of server architecture, PCIe topology, CPUs, GPUs, NICs, Linux OS, and kernel drivers.
- Networking experience, including Ethernet, InfiniBand, or other networking protocols.
- Experience working with DevOps on-prem or in cloud environments, including Docker/Containers, cloud APIs, IaaS, and data center deployments.
- SLURM, Kubernetes, and/or other job scheduler deployment and debugging skills.
- Deep understanding of dense data center design, including compute, storage, networking, cloud APIs, and IaaS.
- Strong analytical and problem-solving skills.
- Strong written and verbal communication skills with ability to collaborate across engineering, sales, marketing, product, and program management.
Ways to stand out
- Demonstrated CPU performance debugging experience.
- Excellent customer-facing skills and background.
- Platform design engineering, coding and proficient debugging skills including experience in C/C++, Linux kernel, virtualization and drivers, profilers/performance analysis tools (CPU & GPU), and telemetry.
- Familiarity with Grace/ARM CPU architecture, NVIDIA systems/SDKs (e.g., CUDA), NVIDIA networking technologies (e.g., RoCE, InfiniBand), and switch interconnects through hands-on experience.
- Understanding of Deep Learning and Machine Learning frameworks (TensorFlow or PyTorch), LLM, MLOps, DevOps, and workflows applying cloud technologies, Docker/containers, Kubernetes, cloud APIs, and datacenter deployments.
Benefits and additional details
- Occasional travel required (~20%) for on-site customer visits and data science conferences.
- Base salary will be determined based on location, experience, and pay of employees in similar positions.
- Base salary ranges:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- Eligible for equity and benefits (see NVIDIA benefits page).
- Applications accepted at least until October 4, 2025.
- NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.