Principal AI Infrastructure SRE Engineer

at Nvidia

📍 Santa Clara, United States

USD 248,000-391,000 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Software Development @ 4 Go @ 6 Kubernetes @ 4 IaC @ 4 Terraform @ 4 Python @ 6 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4

Details

NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation fueled by great technology and amazing people. Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing, where GPUs act as the brains of computers, generative AI, robots, and self-driving cars that understand the world.

Our company is at the forefront of technological innovation and is dedicated to driving efficiency and optimizing the performance of our infrastructure both on-prem and in the cloud.

Responsibilities

Lead initiatives to transform IT infrastructure platform architecture and services on-premises for modern AI workloads and AI semiconductor and software development.
Collaborate with partners to design architecture, build, and operate platforms transforming storage, compute, and middleware with modern security paradigms.
Build software and automation to run infrastructure at scale with minimal human intervention.
Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
Collect and review system data for capacity and planning purposes, analyze capacity data, develop plans for appropriate enterprise-wide system levels, and coordinate with management in implementing changes.
Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.

Requirements

Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
15+ years of proven experience in compute platform engineering with a focus on automation.
Experience with design, deployment, and operation of infrastructure supporting AI and software development at scale, including Kubernetes and integrating modern AI data infrastructure platforms into Kubernetes workloads.
Proven experience integrating existing application architectures and identifying containerization opportunities for improved scalability, reliability, and efficiency.
Proficiency in programming languages such as Go and/or Python.
Experience developing tools for data analysis and performance profiling, with development experience using Terraform and configuration management tools.
Experience designing and running large environments with bare metal servers/virtualized environments consisting of tens of thousands of VMs and cloud or AI infrastructure.
Deep understanding of other infrastructure components such as storage, DNS, Active Directory, security tools, etc.

Ways to Stand Out

Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.
Understanding of AI operations and how to leverage large language models (LLMs) to automate various optimization initiatives.

NVIDIA is widely considered one of the technology world’s most desirable employers, with forward-thinking and hardworking people. Creativity and autonomy are valued highly.

Benefits

You will be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.