Solutions Architect - Cloud Infrastructure

at Nvidia
USD 120,000-235,800 per year
MIDDLE SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Grafana @ 2 Kubernetes @ 3 Prometheus @ 2 Machine Learning @ 3 AWS @ 5 Azure @ 5 Communication @ 3 Mathematics @ 3 Networking @ 3 Performance Optimization @ 3 OpenTelemetry @ 2 GPU @ 3

Details

We are seeking a Cloud Solution Architect to join NVIDIA's cloud solutions team, focusing on large-scale GPU infrastructure and NVIDIA AI Factory deployments. The role centers on architecting and deploying resilient, telemetry-driven AI compute environments, collaborating with engineering teams to secure design wins, and developing tooling for observability, failure recovery, and infrastructure-level performance optimization.

Responsibilities

  • Serve as the technical expert for NVIDIA AI Factory solutions and large-scale GPU infrastructure.
  • Architect and deploy resilient, telemetry-driven AI compute environments at scale.
  • Collaborate directly with engineering teams to secure design wins and address deployment challenges.
  • Develop robust tooling for observability, failure recovery, and infrastructure-level performance optimization.
  • Act as a trusted advisor to clients: assess cloud environments, translate requirements into technical solutions, and provide guidance for scalable, reliable, high-performance GPU workloads.

Requirements

  • 2+ years of experience in large-scale cloud infrastructure engineering, distributed AI/ML systems, or GPU cluster deployment and management.
  • BS in Computer Science, Electrical Engineering, Mathematics, Physics, or equivalent experience.
  • Proven understanding of large-scale computing systems architecture, including multi-node GPU clusters, high-performance networking, and distributed storage.
  • Experience with infrastructure-as-code, automation, and configuration management for large-scale deployments.
  • Passion for machine learning and AI, with a drive to continually learn and apply new technologies.
  • Excellent interpersonal skills and the ability to explain complex technical topics to non-experts.

Preferred / Ways to stand out

  • Expertise with orchestration and workload management tools such as Slurm, Kubernetes, Run:ai, or similar platforms for GPU resource scheduling.
  • Knowledge of AI training and inference performance optimization at scale, including distributed training frameworks and multi-node communication patterns.
  • Hands-on experience designing telemetry systems and failure recovery mechanisms; familiarity with observability tools such as Grafana, Prometheus, and OpenTelemetry.
  • Proficiency in deploying and managing cloud-native GPU-accelerated solutions on AWS, Azure, or Google Cloud.
  • Deep expertise with high-performance networking technologies, particularly NVIDIA InfiniBand, NCCL, and GPU-Direct RDMA.

Compensation & Benefits

  • Base salary ranges (location- and level-dependent):
    • Level 2: 120,000 USD - 189,750 USD per year
    • Level 3: 148,000 USD - 235,750 USD per year
  • Eligible for equity and NVIDIA benefits.

Other details

  • Location: Redmond, WA, United States.
  • Employment type: Full time.
  • Applications accepted at least until October 11, 2025.
  • NVIDIA is an equal opportunity employer committed to a diverse work environment.