Senior Solutions Architect, Cloud Infrastructure And Devops - NVIS

at Nvidia

📍 Taipei, Taiwan

USD 0 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 8 Ansible @ 4 CentOS @ 8 Chef @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 6 R @ 4 CI/CD @ 4 Bash @ 6 Mathematics @ 4 Networking @ 4 Puppet @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is looking for a Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team. Academic and commercial groups around the world use NVIDIA products to revolutionize deep learning and data analytics, powering data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! This role involves working on a dynamic customer-focused team requiring excellent interpersonal skills. The position involves interaction with customers, partners, and internal teams to analyze, define, and implement large-scale networking projects. The scope includes networking, system design, automation, and being the face to the customer.

Responsibilities

Maintain large scale HPC/AI clusters with monitoring, logging, and alerting. Manage Linux job/workload schedulers and orchestration tools.
Develop and maintain continuous integration and delivery pipelines.
Develop tooling to automate deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources.
Deploy monitoring solutions for servers, network, and storage.
Perform troubleshooting from bare metal through OS, software stack, and application levels.
Act as a technical resource, develop, re-define and document standard methodologies to share with internal teams.
Support R&D activities and engage in POCs/POVs for future improvements.

Requirements

BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.
Knowledge of HPC and AI solutions, including CPUs, GPUs, high-speed interconnects, and supporting software.
Extensive knowledge and hands-on experience with Kubernetes including container orchestration for AI/ML workloads, resource scheduling, scaling, and HPC integration.
Experience managing and installing HPC clusters, including deployment, optimization, and troubleshooting.
Excellent knowledge of Linux systems (Redhat/CentOS, Ubuntu), including internals, ACLs, OS-level security, and protocols like TCP, DHCP, DNS.
Experience with multiple storage solutions such as Lustre, GPFS, ZFS, XFS. Familiarity with emerging storage technologies is a plus.
Proficiency in Python programming and bash scripting.
Experienced with automation/configuration management tools including Jenkins, Ansible, Puppet/Chef.

Ways to Stand Out

Knowledge of CI/CD pipelines for software deployment and automation.
Knowledge of Kubernetes and container-related microservice technologies.
Experience with GPU-focused hardware/software (DGX, CUDA).
Background with RDMA fabrics (InfiniBand or RoCE).

Benefits

Competitive salaries and a generous benefits package. NVIDIA values diversity and is an equal opportunity employer. The work environment is creative, autonomous, and challenging with some of the most brilliant talent globally.