Vacancy is archived. Applications are no longer accepted.

Senior Architect, Internal Cloud Infrastructure

at Nvidia

📍 Santa Clara, United States

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 8 Ceph @ 4 Chef @ 4 Docker @ 8 ElasticSearch @ 4 Kafka @ 4 Kubernetes @ 4 Linux @ 4 MySQL @ 4 Python @ 7 SQL @ 4 Java @ 7 NoSQL @ 4 Distributed Systems @ 4 Git @ 4 MongoDB @ 4 OpenStack @ 4 Android @ 4 API @ 7 Hadoop @ 4 Puppet @ 4 Cassandra @ 4

Details

NVIDIA is seeking an AI Solutions Architect to join its Infrastructure Planning and Process Team! This role will focus on the extensive scale-up of key AI solutions for NVIDIA's internal cloud infrastructure. IPP (Infrastructure, Planning and Process) is a global organization within NVIDIA, working closely with various teams such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars to meet their infrastructure needs. The cloud services support nearly half a million automated jobs daily on five thousand servers, enhancing the productivity of thousands of NVIDIA software developers worldwide. The cloud hosts a diverse mix of machines and devices with various operating systems (Windows/Linux/Android) and hardware platforms, including NVIDIA GPUs and Tegra processors.

Responsibilities

Serve as an Architect developing internal AI systems used by thousands of NVIDIANs globally.
Identify gaps and issues and resolve ones that are better suited for AI solutions versus conventional approaches.
Further divide the AI category into 'buy versus build' options by researching available tools in the market.
Align with teams across Nvidia to establish overall AI system goals and break them down into specific objectives for each sub-system.
Drive, motivate, convince, and mentor sub-system leads to achieve improvements with agility and speed.
Identify performance bottlenecks and optimize the speed and cost efficiency of AI development and testing systems.
Drive the planning of software/hardware capacity, covering both internal and public cloud, addressing the balance between time and utilization.
Introduce technologies enabling massively parallel systems to improve turnaround time by an order of magnitude.

Requirements

BS EE/CS or equivalent experience with 10+ years of systems software development with at least 1 year of experience in developing/exploring AI.
Development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Fine-Tuning LLMs, AI Agentic workflows, LangChain, LangGraphs, and Cascading models.
Experience in deploying in hybrid, multi-cloud architecture and edge computing.
Extensive experience architecting and shipping large-scale distributed software systems.
Ability to identify gaps and bottlenecks, and develop solutions to optimize performance.
Strong programming and software development skills in JAVA, Python, Shell-script along with good understanding of distributed systems and REST APIs.
Experience in working with SQL/NoSQL database systems such as MySQL, Cassandra, MongoDB or Elasticsearch.
Excellent knowledge and working experience with Docker containers and Virtual Machines.
Good background of Cloud technologies like: OpenStack, Docker, Kubernetes, Chef/Puppet, Hadoop/Ceph/SwiftStack, LXC, Git, Perforce, JFrog, Kafka.
Ability to work across organizational boundaries optimally to improve alignment and productivity between teams in a multi-national, multi-time-zone corporate environment.

Benefits

With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our best-in-class engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you.