Used Tools & Technologies
Not specified
Required Skills & Competences ?
Grafana @ 2 Kubernetes @ 3 Prometheus @ 2 Machine Learning @ 3 AWS @ 5 Azure @ 5 Communication @ 3 Mathematics @ 3 Networking @ 3 Performance Optimization @ 3 OpenTelemetry @ 2 GPU @ 3Details
We are seeking a Cloud Solution Architect to join NVIDIA's cloud solutions team, focusing on large-scale GPU infrastructure and NVIDIA AI Factory deployments. The role centers on architecting and deploying resilient, telemetry-driven AI compute environments, collaborating with engineering teams to secure design wins, and developing tooling for observability, failure recovery, and infrastructure-level performance optimization.
Responsibilities
- Serve as the technical expert for NVIDIA AI Factory solutions and large-scale GPU infrastructure.
- Architect and deploy resilient, telemetry-driven AI compute environments at scale.
- Collaborate directly with engineering teams to secure design wins and address deployment challenges.
- Develop robust tooling for observability, failure recovery, and infrastructure-level performance optimization.
- Act as a trusted advisor to clients: assess cloud environments, translate requirements into technical solutions, and provide guidance for scalable, reliable, high-performance GPU workloads.
Requirements
- 2+ years of experience in large-scale cloud infrastructure engineering, distributed AI/ML systems, or GPU cluster deployment and management.
- BS in Computer Science, Electrical Engineering, Mathematics, Physics, or equivalent experience.
- Proven understanding of large-scale computing systems architecture, including multi-node GPU clusters, high-performance networking, and distributed storage.
- Experience with infrastructure-as-code, automation, and configuration management for large-scale deployments.
- Passion for machine learning and AI, with a drive to continually learn and apply new technologies.
- Excellent interpersonal skills and the ability to explain complex technical topics to non-experts.
Preferred / Ways to stand out
- Expertise with orchestration and workload management tools such as Slurm, Kubernetes, Run:ai, or similar platforms for GPU resource scheduling.
- Knowledge of AI training and inference performance optimization at scale, including distributed training frameworks and multi-node communication patterns.
- Hands-on experience designing telemetry systems and failure recovery mechanisms; familiarity with observability tools such as Grafana, Prometheus, and OpenTelemetry.
- Proficiency in deploying and managing cloud-native GPU-accelerated solutions on AWS, Azure, or Google Cloud.
- Deep expertise with high-performance networking technologies, particularly NVIDIA InfiniBand, NCCL, and GPU-Direct RDMA.
Compensation & Benefits
- Base salary ranges (location- and level-dependent):
- Level 2: 120,000 USD - 189,750 USD per year
- Level 3: 148,000 USD - 235,750 USD per year
- Eligible for equity and NVIDIA benefits.
Other details
- Location: Redmond, WA, United States.
- Employment type: Full time.
- Applications accepted at least until October 11, 2025.
- NVIDIA is an equal opportunity employer committed to a diverse work environment.