Network Site Reliability Engineer

at Nvidia
USD 168,000-264,500 per year
MIDDLE
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 3 Cumulus Linux @ 3 Go @ 3 Grafana @ 2 Linux @ 3 Prometheus @ 2 Python @ 3 Communication @ 6 Networking @ 6 SRE @ 3 Jira @ 3 ServiceNow @ 3 Debugging @ 3 Swift @ 3 Salt @ 3

Details

NVIDIA is seeking a Network Site Reliability Engineer (SRE) to join the Enterprise Network Operations and SRE team. The role focuses on implementing a reliable and efficient network infrastructure, minimizing manual operational tasks, improving automation and observability, performing blameless postmortems and RCAs, and ensuring high user satisfaction through excellent network operations. The position emphasizes hands-on debugging, network automation, documentation, and operational excellence across enterprise and data center networks.

Responsibilities

  • Own the operational aspect of the network infrastructure to ensure high availability and reliability.
  • Actively work on network incidents and service requests, drive swift resolution and customer satisfaction.
  • Partner with architecture and deployment teams to ensure new implementations are supportable and meet production standards.
  • Advocate for and implement automation to reduce toil and improve operational efficiency.
  • Monitor network performance, identify areas for improvement, and collaborate with relevant teams to implement refinements.
  • Collaborate with domain experts across functions to resolve production issues and follow through on Root Cause Analyses (RCAs) and blameless postmortems.
  • Discover opportunities for operational improvements and develop solutions that enhance reliability and sustainability of network operations.

Requirements

  • BS degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.
  • Minimum of 10 years of industry experience in network operations or related fields, with a focus on automation and site reliability engineering. Familiarity with enterprise and data center networks is critical.
  • Strong fundamentals in networking and experience fixing complex network issues; expertise in technologies such as TCP/UDP, IPv4/IPv6, Wireless, BGP, ISIS, VPN, L2 switching, Firewalls, Load Balancers, and Data Center Network technologies.
  • Familiarity with monitoring and network management tools, including Prometheus, Grafana, Alertmanager, Nautobot/Netbox, and BigPanda.
  • Expertise in network automation frameworks such as Salt, Ansible, or similar.
  • Experience with process and service tooling such as ServiceNow and Jira, and foundational knowledge of ITIL.
  • Knowledge of Linux system fundamentals.
  • Strong problem-solving, critical thinking, communication skills, ownership, and drive.

Ways to stand out

  • Experience taking operational signals (SNMP, Syslog, Streaming Telemetry) to drive automation and operational solutions.
  • Platform exposure such as Mellanox/Cumulus Linux, Palo Alto firewalls, Netscalers, and F5 load balancers.
  • Programming and scripting competence in Python, Go, or related languages; ability to build complex systems for network supervision and control beyond simple scripting.
  • Proficiency with advanced network technologies such as VXLAN/EVPN at scale, MPLS, RSVP, Segment Routing, SDWAN, and SASE platforms.

Compensation & Benefits

  • Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 264,500 USD.
  • You will also be eligible for equity and benefits (see NVIDIA benefits page).

Other

  • Applications for this job will be accepted at least until September 27, 2025.
  • #LI-Hybrid
  • NVIDIA is an equal opportunity employer and fosters a diverse work environment.