Datacenter Resiliency Architect - New College Grad 2025

at Nvidia
USD 120,000-235,800 per year
MIDDLE
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Python @ 3 Machine Learning @ 3 Networking @ 2 Debugging @ 6 CUDA @ 3 GPU @ 3

Details

Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how we can make a lasting impact on the world.

Responsibilities

  • Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.
  • Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements.
  • Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements.
  • Develop and implement comprehensive architecture verification testplans for resiliency features.
  • Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon.
  • Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches.
  • Develop CUDA software diagnostics kernels to run on clusters of NVIDIA GPUs and identify potential hardware issues.
  • Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments.

Requirements

  • Pursuing or recently completed a Master’s or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience.
  • Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts.
  • Proficiency in RAS concepts and in developing Architecture models.
  • Scripting and automation with Python or similar.
  • Proficiency in C/C++.
  • Excellent interpersonal skills and ability to collaborate with on-site and remote teams.
  • Strong debugging and analytical skills.
  • Be self-driven and results oriented.

Ways to stand out from the crowd

  • Experience with resiliency and datacenter RAS.
  • Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components.
  • Programming with CUDA

Benefits

You will be eligible for equity and benefits. NVIDIA fosters a diverse work environment and is an equal opportunity employer committed to diversity and inclusion.