Senior Datacenter Resiliency Architect
at Nvidia
π Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Python @ 4 Machine Learning @ 3 Networking @ 3 Debugging @ 7 CUDA @ 4 GPU @ 4Details
NVIDIA is seeking a Resiliency Architect to support the development and validation of GPU hardware and software resiliency features. This role impacts Datacenter GPUs and SOCs powering AI and HPC product lines.
Responsibilities
- Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in Datacenter.
- Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter.
- Collaborate with architects, unit designers, and software engineers to ensure alignment of verification requirements.
- Develop and implement comprehensive architecture verification test plans for resiliency features.
- Execute test plans by developing test content, working with Software and Architecture teams to enable, run, and debug tests on architectural models; support test debug on RTL, emulation, and silicon.
- Run simulations to analyze Architectural Vulnerability Factor and liveness of on-die memory, flip-flops, and latches.
- Develop CUDA software diagnostic kernels to run on GPU clusters to identify hardware issues.
- Develop and automate fault models simulating transient and stuck-at faults in gate-level netlist, RTL, architectural models, silicon, and other environments.
Requirements
- Masterβs or PhD degree in Computer Engineering, Electrical Engineering or related field, or equivalent experience.
- 5+ years of relevant experience.
- Familiarity with GPU and Networking architectures, computer architecture basics (caches, coherence, buses, DMA), and machine learning/deep learning concepts.
- Strong knowledge/expertise in GPU hardware architecture or RAS features.
- Proficiency in architecture model development.
- Scripting and automation experience with Python or similar.
- Proficiency in C/C++.
- Excellent interpersonal and collaboration skills.
- Strong debugging and analytical capabilities.
- Self-driven and results oriented.
Preferred Qualifications
- Experience with resiliency and datacenter RAS.
- Proficiency in Verilog/SystemVerilog RTL simulation and debug; ability to set up testbenches.
- Programming experience with CUDA.
About NVIDIA
NVIDIA pioneered the GPU in 1999, revolutionizing graphics and parallel computing, and now drives AI advancements as the AI computing company. This position is with their Accelerated and Resilient Compute Systems team.
Salary
The base salary range is $184,000 - $356,500 USD annually, depending on location, experience, and peer benchmarking. Equity and benefits are also provided.
NVIDIA is an equal opportunity employer committed to diversity and inclusion.