Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 8 Ansible @ 4 CentOS @ 8 Chef @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 6 R @ 4 CI/CD @ 4 Bash @ 6 Mathematics @ 4 Networking @ 4 Puppet @ 4 CUDA @ 4 GPU @ 4Details
NVIDIA is looking for a Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team. Academic and commercial groups around the world use NVIDIA products to revolutionize deep learning and data analytics, powering data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! This role involves working on a dynamic customer-focused team requiring excellent interpersonal skills. The position involves interaction with customers, partners, and internal teams to analyze, define, and implement large-scale networking projects. The scope includes networking, system design, automation, and being the face to the customer.
Responsibilities
- Maintain large scale HPC/AI clusters with monitoring, logging, and alerting. Manage Linux job/workload schedulers and orchestration tools.
- Develop and maintain continuous integration and delivery pipelines.
- Develop tooling to automate deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources.
- Deploy monitoring solutions for servers, network, and storage.
- Perform troubleshooting from bare metal through OS, software stack, and application levels.
- Act as a technical resource, develop, re-define and document standard methodologies to share with internal teams.
- Support R&D activities and engage in POCs/POVs for future improvements.
Requirements
- BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
- At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.
- Knowledge of HPC and AI solutions, including CPUs, GPUs, high-speed interconnects, and supporting software.
- Extensive knowledge and hands-on experience with Kubernetes including container orchestration for AI/ML workloads, resource scheduling, scaling, and HPC integration.
- Experience managing and installing HPC clusters, including deployment, optimization, and troubleshooting.
- Excellent knowledge of Linux systems (Redhat/CentOS, Ubuntu), including internals, ACLs, OS-level security, and protocols like TCP, DHCP, DNS.
- Experience with multiple storage solutions such as Lustre, GPFS, ZFS, XFS. Familiarity with emerging storage technologies is a plus.
- Proficiency in Python programming and bash scripting.
- Experienced with automation/configuration management tools including Jenkins, Ansible, Puppet/Chef.
Ways to Stand Out
- Knowledge of CI/CD pipelines for software deployment and automation.
- Knowledge of Kubernetes and container-related microservice technologies.
- Experience with GPU-focused hardware/software (DGX, CUDA).
- Background with RDMA fabrics (InfiniBand or RoCE).
Benefits
Competitive salaries and a generous benefits package. NVIDIA values diversity and is an equal opportunity employer. The work environment is creative, autonomous, and challenging with some of the most brilliant talent globally.