Enhancing Fault Tolerance in High-Performance Computing: A real hardware case study on a RISC-V Vector Processing Unit

Marcello Barbirotta; Francesco Minervini; Carlos Rojas Morales; Adrian Cristal; Osman Unsal; Mauro Olivieri

doi:10.36227/techrxiv.171177404.45560143/v1

loading page

Enhancing Fault Tolerance in High-Performance Computing: A real hardware case study on a RISC-V Vector Processing Unit

Marcello Barbirotta,
Francesco Minervini,
Carlos Rojas Morales,
Adrian Cristal,
Osman Unsal,
Mauro Olivieri

Abstract

High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency.

26 Mar 2024Submitted to TechRxiv

30 Mar 2024Published in TechRxiv

Abstract

Peer review timeline