Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 7 Go @ 6 Linux @ 6 IaC @ 4 Terraform @ 4 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4Details
NVIDIA is seeking a Principal Staff Site Reliability Engineer to join a team focused on driving efficiency and optimizing infrastructure performance both on-premises and in the cloud. The role participates in designing and scaling core infrastructure services, building automation and observability, and collaborating with leadership and cross-functional teams to deliver reliable IT products and services.
Responsibilities
- Lead initiatives to transform the IT Compute Core Team and architecture to build new service offerings across On-Prem and Cloud.
 - Design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP with a focus on performance and reliability at global scale.
 - Build automation, monitoring, high availability, capacity planning, and lifecycle management for core services.
 - Define and implement metrics to measure service efficiency and drive efficiency with software and hardware optimizations (SR-IOV / DPU).
 - Use technologies like eBPF and XDP for observability and DDoS mitigation.
 - Collect and review system data for capacity and planning, analyze capacity data, and develop enterprise-wide plans; coordinate with management to implement changes.
 - Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
 - Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop IT products and services that meet customer needs.
 
Requirements
- Bachelor’s degree in Engineering, Computer Science, Mathematics, or a related field, or equivalent experience.
 - 15+ years of proven experience in compute platform engineering with a focus on automation.
 - Experience designing and deploying containerization architectures and distributed systems infrastructure.
 - Proven experience evaluating existing application architectures and identifying containerization opportunities to improve scalability, reliability, and efficiency.
 - Strong analytical skills with the ability to define and track key performance metrics.
 - Experience developing tools for data analysis and performance profiling; development with Terraform and configuration management tools.
 - Proficiency in programming languages such as Go and/or Python.
 - Linux OS proficiency, including kernel internals.
 - Experience running large environments consisting of bare-metal build infrastructure.
 - Understanding of network protocols and architectures (VLAN / VxLAN / SDN / BGP / Anycast).
 
Preferred / Ways To Stand Out
- Deep understanding of other infrastructure components such as DNS, LDAP, and security tools.
 - Hands-on experience with containers and their implementation.
 - Experience deploying and managing services like DNS and LDAP at scale.
 - Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.
 
Compensation & Additional Information
- Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and internal pay equity).
 - Eligible for equity and benefits (see NVIDIA benefits information on the NVIDIA website).
 - #LI-Hybrid
 - Applications accepted at least until August 24, 2025.
 
NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment. The company does not discriminate on the basis of characteristics protected by law.