Network Site Reliability Engineer

at Nvidia
USD 168,000-264,500 per year
MIDDLE
βœ… Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 3 Cumulus Linux @ 3 Go @ 3 Grafana @ 2 Linux @ 3 Prometheus @ 2 Python @ 3 Communication @ 3 SRE @ 3 Jira @ 3 ServiceNow @ 3 Debugging @ 3 Salt @ 3

Details

The Enterprise Network Support and SRE team is looking to add a seasoned Technical SRE lead to help actualize the SRE vision for our network infrastructure. This role focuses on making network operation seamless with emphasis on user experience, automation, observability, documentation, and operational excellence. The Network SRE will enhance network operations, minimize manual labor, achieve Service Level Objectives (SLOs), document runbooks/KBs usable by bots, follow through on RCAs and blameless postmortems, and proactively identify and mitigate network risks. The role includes hands-on troubleshooting and mentorship responsibilities.

Responsibilities

  • Own the operational aspects of network infrastructure, ensuring high availability and reliability.
  • Partner with architecture and deployment teams to ensure new implementations are supportable and meet production standards.
  • Advocate for and implement automation to reduce toil and improve operational efficiency.
  • Monitor network performance, identify improvements, and coordinate with teams to execute enhancements.
  • Collaborate with SMEs to resolve production issues promptly and maintain customer satisfaction.
  • Identify operational improvement opportunities and partner with teams to develop sustainable solutions.

Requirements

  • BS degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.
  • Minimum of 8 years of industry experience in network site reliability engineering, network automation, network operations, or related areas. Experience on both campus and data center networks.
  • Familiarity with network management and observability tools such as Prometheus, Grafana, Alert Manager, Nautobot/Netbox, and BigPanda.
  • Expertise in network automation frameworks such as Salt, Ansible, or similar.
  • In-depth experience in one or more programming languages: Python, Go.
  • Knowledge of network technologies and protocols: TCP/UDP, IPv4/IPv6, Wireless, BGP, VPN, L2 switching, Firewalls, Load Balancers, EVPN, VxLAN, Segment Routing.
  • Experience with ServiceNow and Jira.
  • Knowledge of Linux system fundamentals is a plus.
  • Systematic problem-solving approach, excellent communication skills, ownership, and drive.

Ways to Stand Out

  • Experience taking operational signals (SNMP, Syslog, Streaming Telemetry) to solve operational challenges.
  • History of debugging and optimizing code and automating routine tasks.
  • Experience with Mellanox/Cumulus Linux, Palo Alto firewalls, Netscalers, and F5 load balancers.
  • Previous SRE experience.

Benefits & Additional Information

  • Base salary range: 168,000 USD - 264,500 USD (determined by location, experience, and pay of employees in similar positions).
  • Eligible for equity and benefits.
  • NVIDIA is an equal opportunity employer committed to diversity.
  • #LI-Hybrid
  • Applications accepted at least until July 29, 2025.