Senior Network Reliability Engineer - DGX Cloud

at Nvidia
USD 136,000-264,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

System Administration @ 4 Grafana @ 3 Linux @ 4 Prometheus @ 3 Python @ 4 GCP @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Networking @ 3 AI @ 4 InfiniBand @ 4 Change Management @ 4

Details

NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain cloud and datacenter network infrastructures that serve the needs across the whole NVIDIA software stack, from graphics drivers to autonomous vehicles and AI. The role focuses on remediation of critical alerts, triage of production-impacting network incidents, vendor engagement for hardware/software issues, and participation in project work such as device upgrades and capacity augmentations.

Responsibilities

  • Remediate critical alerts within defined SLAs and triage production-impacting network incidents.
  • Participate in 24/7 global shift rotations to provide remote support for network repairs and changes; collaborate across teams and update customers on status and tickets.
  • Engage with external vendors to remediate hardware and software issues.
  • Drive operational improvements in change management and daily operations by following and improving procedures.
  • Manage and operate large-scale IP network technologies and infrastructures (on-premises and cloud).
  • Utilize Peering and datacenter interconnect technologies: PNI, Transit, Exchange, Passive DWDM, Wave circuits.
  • Monitor and support network health across on-premises and cloud infrastructures.
  • Collaborate and develop workflow enhancements while documenting best practices.

Requirements

  • Deep knowledge and experience with TCP/IP, BGP, OSPF, MPLS, IS-IS, VxLAN, EVPN, QoS, GRE, IPsec, DNS, and MACsec.
  • 5+ years of experience in network operations.
  • Strong network troubleshooting skills and creative problem-solving abilities.
  • Proven track record of alert response within defined SLAs and incident management experience.
  • Experience with one or more cloud service provider environments: AWS, Azure, GCP, OCI.
  • Familiarity with Arista, Fortinet, and Juniper networking equipment.
  • Hands-on experience contributing to tooling and automation for provisioning, monitoring, and managing complex network infrastructures.
  • Bachelor’s degree in Computer Science, a related technical field, or equivalent experience.
  • Excellent verbal and written communication skills.

Ways To Stand Out From The Crowd

  • Solid understanding of Mellanox/Cumulus OS and InfiniBand technology.
  • Unix/Linux system administration skills and ability to write/understand Python and shell scripts to improve efficiency in hyperscale environments.
  • Familiarity with NetBox/Nautobot, Prometheus, Grafana, and Panoptes for monitoring and managing a global network.

Compensation

  • Base salary range: $136,000 - $224,250 USD (Level 3).
  • Base salary range: $168,000 - $264,500 USD (Level 4).
  • Eligible for equity and benefits.

Other information

  • Applications accepted at least until May 17, 2026.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.