Principal Staff Site Reliability Engineer

at Nvidia
USD 248,000-391,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 7 Go @ 6 Linux @ 6 IaC @ 4 Terraform @ 4 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4

Details

NVIDIA is reinventing computer graphics, PC gaming, and accelerated computing and is focused on defining the next era of computing driven by AI. The team is seeking a Principal Staff Site Reliability Engineer to drive efficiency and optimize the performance of core infrastructure services both on-premises and in the cloud. The role partners with leadership, senior engineers, program and product managers to design, scale, and deploy reliable infrastructure services at global scale.

Responsibilities

  • Lead initiatives to transform the IT Compute Core Team and its architecture to build new service offerings across on-prem and cloud environments.
  • Design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP with focus on automation, monitoring, high availability, capacity planning, and lifecycle management.
  • Define and implement metrics to measure service efficiency and drive improvements via software and hardware optimizations (SR-IOV / DPU).
  • Use technologies like eBPF and XDP for observability and DDoS mitigation.
  • Collect and review system data for capacity planning; analyze capacity data and develop enterprise-wide plans and coordinate implementation with management.
  • Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to deliver compelling IT products and services that meet customer needs.

Requirements

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
  • 15+ years of proven experience in compute platform engineering with a focus on automation.
  • Experience designing and deploying containerization architectures and distributed systems infrastructure.
  • Proven experience evaluating application architectures and identifying opportunities for containerization to improve scalability, reliability, and efficiency.
  • Strong analytical skills with ability to define and track key performance metrics.
  • Experience developing tools for data analysis and performance profiling; development experience with Terraform and configuration management tools.
  • Proficiency in programming languages such as Go and/or Python.
  • Linux OS proficiency with knowledge of kernel internals.
  • Experience running large environments including bare-metal build infrastructure.
  • Understanding of network protocols and architectures, including VLAN, VxLAN, SDN, BGP, and Anycast.

Ways To Stand Out

  • Deep understanding of infrastructure components such as DNS, LDAP, and security tools.
  • Hands-on experience with containers and their implementation.
  • Experience deploying and managing services like DNS and LDAP at scale.
  • Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.

Compensation & Other Details

  • Base salary range: 248,000 USD - 391,000 USD (final base determined by location, experience, and internal pay parity).
  • Eligible for equity and company benefits.
  • Location: Santa Clara, CA, United States; position listed as hybrid (#LI-Hybrid).
  • Applications accepted at least until August 24, 2025.
  • NVIDIA is an equal opportunity employer committed to diversity and inclusion.