Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 7 Go @ 6 Linux @ 6 IaC @ 4 Terraform @ 4 Python @ 6 Distributed Systems @ 4 Leadership @ 4 Mathematics @ 4 Microservices @ 4 Data Analysis @ 4 Reporting @ 4Details
NVIDIA is reinventing computer graphics, PC gaming, and accelerated computing and is focused on defining the next era of computing driven by AI. The team is seeking a Principal Staff Site Reliability Engineer to drive efficiency and optimize the performance of core infrastructure services both on-premises and in the cloud. The role partners with leadership, senior engineers, program and product managers to design, scale, and deploy reliable infrastructure services at global scale.
Responsibilities
- Lead initiatives to transform the IT Compute Core Team and its architecture to build new service offerings across on-prem and cloud environments.
- Design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP with focus on automation, monitoring, high availability, capacity planning, and lifecycle management.
- Define and implement metrics to measure service efficiency and drive improvements via software and hardware optimizations (SR-IOV / DPU).
- Use technologies like eBPF and XDP for observability and DDoS mitigation.
- Collect and review system data for capacity planning; analyze capacity data and develop enterprise-wide plans and coordinate implementation with management.
- Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.
- Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to deliver compelling IT products and services that meet customer needs.
Requirements
- Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
- 15+ years of proven experience in compute platform engineering with a focus on automation.
- Experience designing and deploying containerization architectures and distributed systems infrastructure.
- Proven experience evaluating application architectures and identifying opportunities for containerization to improve scalability, reliability, and efficiency.
- Strong analytical skills with ability to define and track key performance metrics.
- Experience developing tools for data analysis and performance profiling; development experience with Terraform and configuration management tools.
- Proficiency in programming languages such as Go and/or Python.
- Linux OS proficiency with knowledge of kernel internals.
- Experience running large environments including bare-metal build infrastructure.
- Understanding of network protocols and architectures, including VLAN, VxLAN, SDN, BGP, and Anycast.
Ways To Stand Out
- Deep understanding of infrastructure components such as DNS, LDAP, and security tools.
- Hands-on experience with containers and their implementation.
- Experience deploying and managing services like DNS and LDAP at scale.
- Solid understanding of microservices architecture, infrastructure as code (IaC), and configuration management tools.
Compensation & Other Details
- Base salary range: 248,000 USD - 391,000 USD (final base determined by location, experience, and internal pay parity).
- Eligible for equity and company benefits.
- Location: Santa Clara, CA, United States; position listed as hybrid (#LI-Hybrid).
- Applications accepted at least until August 24, 2025.
- NVIDIA is an equal opportunity employer committed to diversity and inclusion.