Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 7 Go @ 6 Linux @ 6 IaC @ 4 Terraform @ 4 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4Details
NVIDIA is seeking a Principal Staff Site Reliability Engineer to join a team focused on driving efficiency and optimizing infrastructure performance both on-premises and in the cloud. The role participates in designing and scaling core infrastructure services, building automation and observability, and collaborating with leadership and cross-functional teams to deliver reliable IT products and services.
Responsibilities
- Lead initiatives to transform the IT Compute Core Team and architecture to build new service offerings across On-Prem and Cloud.
- Design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP with a focus on performance and reliability at global scale.
- Build automation, monitoring, high availability, capacity planning, and lifecycle management for core services.
- Define and implement metrics to measure service efficiency and drive efficiency with software and hardware optimizations (SR-IOV / DPU).
- Use technologies like eBPF and XDP for observability and DDoS mitigation.
- Collect and review system data for capacity and planning, analyze capacity data, and develop enterprise-wide plans; coordinate with management to implement changes.
- Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
- Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop IT products and services that meet customer needs.
Requirements
- Bachelor’s degree in Engineering, Computer Science, Mathematics, or a related field, or equivalent experience.
- 15+ years of proven experience in compute platform engineering with a focus on automation.
- Experience designing and deploying containerization architectures and distributed systems infrastructure.
- Proven experience evaluating existing application architectures and identifying containerization opportunities to improve scalability, reliability, and efficiency.
- Strong analytical skills with the ability to define and track key performance metrics.
- Experience developing tools for data analysis and performance profiling; development with Terraform and configuration management tools.
- Proficiency in programming languages such as Go and/or Python.
- Linux OS proficiency, including kernel internals.
- Experience running large environments consisting of bare-metal build infrastructure.
- Understanding of network protocols and architectures (VLAN / VxLAN / SDN / BGP / Anycast).
Preferred / Ways To Stand Out
- Deep understanding of other infrastructure components such as DNS, LDAP, and security tools.
- Hands-on experience with containers and their implementation.
- Experience deploying and managing services like DNS and LDAP at scale.
- Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.
Compensation & Additional Information
- Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and internal pay equity).
- Eligible for equity and benefits (see NVIDIA benefits information on the NVIDIA website).
- #LI-Hybrid
- Applications accepted at least until August 24, 2025.
NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment. The company does not discriminate on the basis of characteristics protected by law.