Solutions Architect - Cloud Infrastructure

at Nvidia

📍 Redmond, United States

USD 120,000-235,800 per year

MIDDLE SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Grafana @ 2 Kubernetes @ 3 Prometheus @ 2 Machine Learning @ 3 AWS @ 5 Azure @ 5 Communication @ 3 Mathematics @ 3 Networking @ 3 Performance Optimization @ 3 OpenTelemetry @ 2 GPU @ 3

Details

We are seeking a Cloud Solution Architect to join NVIDIA's cloud solutions team, focusing on large-scale GPU infrastructure and NVIDIA AI Factory deployments. The role centers on architecting and deploying resilient, telemetry-driven AI compute environments, collaborating with engineering teams to secure design wins, and developing tooling for observability, failure recovery, and infrastructure-level performance optimization.

Responsibilities

Serve as the technical expert for NVIDIA AI Factory solutions and large-scale GPU infrastructure.
Architect and deploy resilient, telemetry-driven AI compute environments at scale.
Collaborate directly with engineering teams to secure design wins and address deployment challenges.
Develop robust tooling for observability, failure recovery, and infrastructure-level performance optimization.
Act as a trusted advisor to clients: assess cloud environments, translate requirements into technical solutions, and provide guidance for scalable, reliable, high-performance GPU workloads.

Requirements

2+ years of experience in large-scale cloud infrastructure engineering, distributed AI/ML systems, or GPU cluster deployment and management.
BS in Computer Science, Electrical Engineering, Mathematics, Physics, or equivalent experience.
Proven understanding of large-scale computing systems architecture, including multi-node GPU clusters, high-performance networking, and distributed storage.
Experience with infrastructure-as-code, automation, and configuration management for large-scale deployments.
Passion for machine learning and AI, with a drive to continually learn and apply new technologies.
Excellent interpersonal skills and the ability to explain complex technical topics to non-experts.

Preferred / Ways to stand out

Expertise with orchestration and workload management tools such as Slurm, Kubernetes, Run:ai, or similar platforms for GPU resource scheduling.
Knowledge of AI training and inference performance optimization at scale, including distributed training frameworks and multi-node communication patterns.
Hands-on experience designing telemetry systems and failure recovery mechanisms; familiarity with observability tools such as Grafana, Prometheus, and OpenTelemetry.
Proficiency in deploying and managing cloud-native GPU-accelerated solutions on AWS, Azure, or Google Cloud.
Deep expertise with high-performance networking technologies, particularly NVIDIA InfiniBand, NCCL, and GPU-Direct RDMA.

Compensation & Benefits

Base salary ranges (location- and level-dependent):
- Level 2: 120,000 USD - 189,750 USD per year
- Level 3: 148,000 USD - 235,750 USD per year
Eligible for equity and NVIDIA benefits.

Other details

Location: Redmond, WA, United States.
Employment type: Full time.
Applications accepted at least until October 11, 2025.
NVIDIA is an equal opportunity employer committed to a diverse work environment.