Senior Solutions Architect, Cloud Infrastructure And DevOps - NVIS

at Nvidia
USD 150,000-200,000 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 8 Ansible @ 4 CentOS @ 8 Chef @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 DevOps @ 4 Python @ 6 CI/CD @ 4 Bash @ 6 Mathematics @ 4 Networking @ 4 Puppet @ 4 CUDA @ 4 GPU @ 4

Details

NVIDIA is looking for Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team. Academic and commercial groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! This role requires excellent interpersonal skills and involves interaction with customers, partners, and internal teams to analyze, define, and implement large scale Networking projects. The scope includes Networking, System Design, Automation, and being the face to the customer.

Responsibilities

  • Maintain large scale HPC/AI clusters with monitoring, logging, and alerting.
  • Manage Linux job/workload schedulers and orchestration tools.
  • Develop and maintain continuous integration and delivery pipelines.
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, operational monitoring and alerting, and enable self-service consumption of resources.
  • Deploy monitoring solutions for servers, network, and storage.
  • Troubleshoot from bare metal, operating system, software stack to application level.
  • Develop, redefine and document standard methodologies to share with internal teams.
  • Support Research & Development activities and engage in POCs/POVs for future improvements.

Requirements

  • BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
  • At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.
  • Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
  • Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and HPC integration.
  • Experience managing and installing HPC clusters, deployment, optimization, and troubleshooting.
  • Excellent knowledge of Linux systems (Redhat/CentOS and Ubuntu), internals, ACLs, OS-level security protections, TCP, DHCP, DNS.
  • Experience with storage solutions: Lustre, GPFS, ZFS, XFS; familiarity with emerging storage technologies.
  • Proficiency in Python programming and bash scripting.
  • Comfortable with automation and configuration management tools like Jenkins, Ansible, Puppet/Chef.

Ways to Stand Out

  • Knowledge of CI/CD pipelines for software deployment and automation.
  • Knowledge of Kubernetes, container-related microservice technologies.
  • Experience with GPU-focused hardware/software (DGX, CUDA).
  • Background with RDMA (InfiniBand or RoCE) fabrics.

About NVIDIA

NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization. We offer highly competitive salaries, extensive benefits, and a work environment that promotes diversity, inclusion, and flexibility. We are an equal opportunity employer, fostering a supportive and empowering workplace for all.