Datacenter Resiliency Architect - New College Grad 2025
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 3 Machine Learning @ 3 Networking @ 2 Debugging @ 6 CUDA @ 3 GPU @ 3Details
NVIDIA is seeking a Resiliency Architect to support the development and validation of GPU hardware and software resiliency features. You will contribute to the design and verification of Datacenter GPUs and SoCs used for AI and high-performance computing. The role involves architecting RAS (Reliability, Availability, Serviceability) features, modeling and analyzing RAS metrics, collaborating with architects and engineers, developing verification testplans, implementing diagnostics and fault models, and running simulations across architecture models, RTL, emulation, and silicon.
Responsibilities
- Architect hardware and software resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the datacenter.
- Model and analyze RAS metrics (e.g., Failures in Time for permanent and transient errors) and Availability from GPU to Rack to Datacenter; use models to identify gaps and drive RAS improvements.
- Collaborate with architects, unit designers, and software engineers to ensure alignment of verification requirements.
- Develop and implement comprehensive architecture verification test plans for resiliency features.
- Execute architecture test plans by developing test content; enable, run, and debug tests on architecture models; support test debug on RTL, emulation, and silicon.
- Run simulations to analyze Architectural Vulnerability Factor and liveness of on-die memory, flip-flops, and latches.
- Develop CUDA software diagnostic kernels to run on clusters of NVIDIA GPUs and identify potential hardware issues.
- Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon, and other environments.
Requirements
- Pursuing or recently completed a Masterβs or PhD in Computer Engineering, Electrical Engineering, or closely related degree, or equivalent experience.
- Familiarity with GPU and networking architectures and computer architecture basics (including caches, coherence, buses, DMA, etc.).
- Knowledge of machine learning / deep learning concepts.
- Proficiency in RAS concepts and in developing architecture models.
- Scripting and automation experience with Python or similar.
- Proficiency in C/C++.
- Excellent interpersonal skills and ability to collaborate with on-site and remote teams.
- Strong debugging and analytical skills; self-driven and results oriented.
Preferred / Ways to Stand Out
- Experience with resiliency and datacenter RAS.
- Proficiency in Verilog/SystemVerilog RTL simulations and debug; ability to set up test benches and integrate components.
- Programming experience with CUDA.
Compensation & Other Details
- Base salary ranges provided by NVIDIA for this role:
- Level 2: 120,000 USD - 189,750 USD per year
- Level 3: 148,000 USD - 235,750 USD per year
- You will also be eligible for equity and benefits.
- Applications for this job are accepted at least until July 29, 2025.
About the Team
Join the Accelerated and Resilient Compute Systems team to help build resilient, highly available, cost-effective computing platforms for AI and HPC workloads. Work involves close collaboration across architecture, software, verification, and silicon teams.