Principal Staff Site Reliability Engineer

at Nvidia

📍 Santa Clara, United States

USD 248,000-391,000 per year

SENIOR

✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 7 Go @ 6 Linux @ 6 IaC @ 4 Terraform @ 4 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4

Details

NVIDIA is seeking a Principal Staff Site Reliability Engineer to join a team focused on driving efficiency and optimizing infrastructure performance both on-premises and in the cloud. The role participates in designing and scaling core infrastructure services, building automation and observability, and collaborating with leadership and cross-functional teams to deliver reliable IT products and services.

Responsibilities

Lead initiatives to transform the IT Compute Core Team and architecture to build new service offerings across On-Prem and Cloud.
Design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP with a focus on performance and reliability at global scale.
Build automation, monitoring, high availability, capacity planning, and lifecycle management for core services.
Define and implement metrics to measure service efficiency and drive efficiency with software and hardware optimizations (SR-IOV / DPU).
Use technologies like eBPF and XDP for observability and DDoS mitigation.
Collect and review system data for capacity and planning, analyze capacity data, and develop enterprise-wide plans; coordinate with management to implement changes.
Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop IT products and services that meet customer needs.

Requirements

Bachelor’s degree in Engineering, Computer Science, Mathematics, or a related field, or equivalent experience.
15+ years of proven experience in compute platform engineering with a focus on automation.
Experience designing and deploying containerization architectures and distributed systems infrastructure.
Proven experience evaluating existing application architectures and identifying containerization opportunities to improve scalability, reliability, and efficiency.
Strong analytical skills with the ability to define and track key performance metrics.
Experience developing tools for data analysis and performance profiling; development with Terraform and configuration management tools.
Proficiency in programming languages such as Go and/or Python.
Linux OS proficiency, including kernel internals.
Experience running large environments consisting of bare-metal build infrastructure.
Understanding of network protocols and architectures (VLAN / VxLAN / SDN / BGP / Anycast).

Preferred / Ways To Stand Out

Deep understanding of other infrastructure components such as DNS, LDAP, and security tools.
Hands-on experience with containers and their implementation.
Experience deploying and managing services like DNS and LDAP at scale.
Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.

Compensation & Additional Information

Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and internal pay equity).
Eligible for equity and benefits (see NVIDIA benefits information on the NVIDIA website).
#LI-Hybrid
Applications accepted at least until August 24, 2025.

NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment. The company does not discriminate on the basis of characteristics protected by law.