Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Machine Learning @ 4 MLOps @ 4 TensorFlow @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 Networking @ 4 IaaS @ 4 Debugging @ 4 API @ 4 PyTorch @ 4 CUDA @ 3 GPU @ 4Details
NVIDIA is seeking a Field Escalation Solution Architect experienced in validation and debugging of large-scale GPU clusters with a focus on performance. As part of the Solution Architecture organization you will work with sophisticated computing hardware and software, helping architect and scale high-performance, distributed AI infrastructure (on-prem or cloud) built with NVIDIA GPU systems. The role involves validating and debugging customer cluster performance issues, identifying functional bottlenecks, and driving technical engagements around NVIDIA products and technologies.
Responsibilities
- Validate and debug customer cluster performance issues and functional bottlenecks from bare metal through OS, software stack, and application level.
- Architect and scale high-performance distributed AI infrastructure on-premises or in the cloud using NVIDIA GPU supercomputers.
- Deliver demos, assist with proofs-of-concept (POCs), and create technical content (papers, developer blogs) to share knowledge across teams.
- Work directly with developers and hardware architects to debug cluster performance issues, identify requirements, and improve workflows.
- Support account teams by providing in-depth analysis during customer escalations and technical engagements.
- Build custom product demonstrations and POCs addressing customers' critical business needs.
Requirements
- BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related engineering field, or equivalent experience.
- 8+ years of work-related experience in NVIDIA and/or accelerated computing technologies.
- Platform-level understanding of server architecture, PCIe topology, CPUs, GPUs, NICs, Linux OS, and kernel drivers.
- Experience with networking technologies, including Ethernet, InfiniBand (and related RDMA technologies such as RoCE).
- Experience working with DevOps in on-prem or cloud environments, including Docker/containers, cloud APIs, IaaS and data center deployments.
- Experience with job schedulers such as SLURM, and container orchestration such as Kubernetes (deployment and debugging skills).
- Deep understanding of dense data center design: compute, storage, networking, cloud APIs, and IaaS.
- Strong analytical, problem-solving, and communication skills; ability to collaborate across engineering, sales, product, and program management.
Ways to stand out
- Demonstrated CPU performance debugging experience.
- Excellent customer-facing skills and background.
- Platform design engineering, coding and proficient debugging skills, including experience in C/C++, Linux kernel, virtualization and drivers.
- Experience with profilers/performance analysis tools and telemetry.
- Familiarity with Grace/ARM CPU architecture, NVIDIA systems/SDKs (e.g., CUDA) and NVIDIA networking technologies (e.g., RoCE, InfiniBand), and switch interconnects.
- Understanding of deep learning and machine learning frameworks (TensorFlow or PyTorch), LLMs, MLOps, DevOps, and related workflows.
Logistics, compensation & other details
- Occasional travel required (~20%) for customer on-site visits and conferences.
- Base salary ranges provided by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- Eligible for equity and benefits as described by NVIDIA.
- Applications accepted at least until October 4, 2025.
Equal opportunity
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The company does not discriminate in hiring or promotion on the basis of protected characteristics.