Principal Staff Site Reliability Engineer

at Nvidia
USD 248,000-391,000 per year
SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 7 Go @ 6 Linux @ 6 IaC @ 4 Terraform @ 4 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4

Details

NVIDIA is seeking a Principal Staff Site Reliability Engineer to join a team focused on driving efficiency and optimizing infrastructure performance both on-premises and in the cloud. The role participates in designing and scaling core infrastructure services, building automation and observability, and collaborating with leadership and cross-functional teams to deliver reliable IT products and services.

Responsibilities

  • Lead initiatives to transform the IT Compute Core Team and architecture to build new service offerings across On-Prem and Cloud.
  • Design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP with a focus on performance and reliability at global scale.
  • Build automation, monitoring, high availability, capacity planning, and lifecycle management for core services.
  • Define and implement metrics to measure service efficiency and drive efficiency with software and hardware optimizations (SR-IOV / DPU).
  • Use technologies like eBPF and XDP for observability and DDoS mitigation.
  • Collect and review system data for capacity and planning, analyze capacity data, and develop enterprise-wide plans; coordinate with management to implement changes.
  • Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop IT products and services that meet customer needs.

Requirements

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or a related field, or equivalent experience.
  • 15+ years of proven experience in compute platform engineering with a focus on automation.
  • Experience designing and deploying containerization architectures and distributed systems infrastructure.
  • Proven experience evaluating existing application architectures and identifying containerization opportunities to improve scalability, reliability, and efficiency.
  • Strong analytical skills with the ability to define and track key performance metrics.
  • Experience developing tools for data analysis and performance profiling; development with Terraform and configuration management tools.
  • Proficiency in programming languages such as Go and/or Python.
  • Linux OS proficiency, including kernel internals.
  • Experience running large environments consisting of bare-metal build infrastructure.
  • Understanding of network protocols and architectures (VLAN / VxLAN / SDN / BGP / Anycast).

Preferred / Ways To Stand Out

  • Deep understanding of other infrastructure components such as DNS, LDAP, and security tools.
  • Hands-on experience with containers and their implementation.
  • Experience deploying and managing services like DNS and LDAP at scale.
  • Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.

Compensation & Additional Information

  • Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and internal pay equity).
  • Eligible for equity and benefits (see NVIDIA benefits information on the NVIDIA website).
  • #LI-Hybrid
  • Applications accepted at least until August 24, 2025.

NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment. The company does not discriminate on the basis of characteristics protected by law.