Senior Network Reliability Engineer - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 136,000-264,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

System Administration @ 4 Grafana @ 3 Linux @ 4 Prometheus @ 3 Python @ 4 GCP @ 4 AWS @ 4 Azure @ 4 Communication @ 4 Networking @ 3 AI @ 4 InfiniBand @ 4 Change Management @ 4

Details

NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain cloud and datacenter network infrastructures that serve the needs across the whole NVIDIA software stack, from graphics drivers to autonomous vehicles and AI. The role focuses on remediation of critical alerts, triage of production-impacting network incidents, vendor engagement for hardware/software issues, and participation in project work such as device upgrades and capacity augmentations.

Responsibilities

Remediate critical alerts within defined SLAs and triage production-impacting network incidents.
Participate in 24/7 global shift rotations to provide remote support for network repairs and changes; collaborate across teams and update customers on status and tickets.
Engage with external vendors to remediate hardware and software issues.
Drive operational improvements in change management and daily operations by following and improving procedures.
Manage and operate large-scale IP network technologies and infrastructures (on-premises and cloud).
Utilize Peering and datacenter interconnect technologies: PNI, Transit, Exchange, Passive DWDM, Wave circuits.
Monitor and support network health across on-premises and cloud infrastructures.
Collaborate and develop workflow enhancements while documenting best practices.

Requirements

Deep knowledge and experience with TCP/IP, BGP, OSPF, MPLS, IS-IS, VxLAN, EVPN, QoS, GRE, IPsec, DNS, and MACsec.
5+ years of experience in network operations.
Strong network troubleshooting skills and creative problem-solving abilities.
Proven track record of alert response within defined SLAs and incident management experience.
Experience with one or more cloud service provider environments: AWS, Azure, GCP, OCI.
Familiarity with Arista, Fortinet, and Juniper networking equipment.
Hands-on experience contributing to tooling and automation for provisioning, monitoring, and managing complex network infrastructures.
Bachelor’s degree in Computer Science, a related technical field, or equivalent experience.
Excellent verbal and written communication skills.

Ways To Stand Out From The Crowd

Solid understanding of Mellanox/Cumulus OS and InfiniBand technology.
Unix/Linux system administration skills and ability to write/understand Python and shell scripts to improve efficiency in hyperscale environments.
Familiarity with NetBox/Nautobot, Prometheus, Grafana, and Panoptes for monitoring and managing a global network.

Compensation

Base salary range: $136,000 - $224,250 USD (Level 3).
Base salary range: $168,000 - $264,500 USD (Level 4).
Eligible for equity and benefits.

Other information

Applications accepted at least until May 17, 2026.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.