Senior System Architect, Infrastructure Reliability

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences

Kubernetes @ 4 Linux @ 4 Python @ 7 Distributed Systems @ 4 Machine Learning @ 4 Reporting @ 4 CUDA @ 4 GPU @ 4 AI @ 4 Slurm @ 4 HPC @ 4

Details

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. The role is to develop and build an automated framework that ingests telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time, distinguishing between hardware faults, infrastructure instability, and software defects.

Responsibilities

  • Architect Failure Attribution Frameworks: build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across CPU, GPU, and fabric at the moment of failure.
  • Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions, and connect these errors with system-level events such as OOM kills or NUMA-related hangs.
  • Distributed logging & tracing: implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.
  • Root cause automation: develop heuristics and machine learning models to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue" to reduce Mean Time to Identify (MTTI) for R&D teams.
  • Resiliency engineering: work with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or checkpointing before a crash.

Requirements

  • Education: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming.
  • Distributed systems mastery and proven experience building automated RCA (root cause analysis) pipelines for HPC or cloud-scale environments.
  • CPU architecture expertise (x86/ARM) with deep knowledge of node-level metrics: IPC, cache contention, NUMA imbalance, and hardware interrupts.
  • Strong programming proficiency in C++ and Python; ability to build high-performance daemons that monitor system health without impacting workload performance.
  • Familiarity with cluster resource managers and job lifecycle/signal propagation: Slurm, LSF, or Kubernetes.

Ways to Stand Out

  • Low-level diagnostics: expert knowledge of the Linux kernel and error-reporting interfaces (/dev/mcelog, dmesg, journald) and how the kernel handles hardware exceptions and memory faults.
  • GPU infrastructure proficiency: deep experience with NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps.
  • Experience with non-intrusive monitoring of application health and syscall-level failure patterns.
  • Experience with checkpoint/restore technologies like CRIU and their application in long-running EDA flows.

Compensation & Benefits

  • Base salary ranges by level: Level 4 — 184,000 USD to 287,500 USD; Level 5 — 224,000 USD to 356,500 USD.
  • Eligible for equity and benefits (link to NVIDIA benefits referenced in the posting).

Other

  • Office arrangement: #LI-Hybrid (hybrid role).
  • Applications accepted at least until March 1, 2026.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.