Speaker
Description
Recent advancements in Artificial Intelligence (AI) and AI hardware accelerators have paved the way for on-edge AI processing with many benefits such as reduced data bandwidth and increased power efficiency. Applications in harsh radiation environments could also benefit from these improvements. However, due to the complex nature of both accelerators and the AI models running on them, the effect of Single Event Upset (SEU) induced faults isn’t understood very well. This research aims to perform an in-depth analysis of SEU induced faults on an AI accelerator under different workloads to obtain information for implementing an efficient fault mitigation strategy.
Summary (500 words)
In this research, we present an AI accelerator, specifically tailored at Deep Neural Networks (DNNs). This full System on Chip (SoC) contains a data processing pipeline, control logic, data and instruction memories, and a communication module for interfacing with external components (highlighted in figure 1). Like most AI accelerators, the largest part of its hardware comprises of memories and the data pipeline. Many studies have shown Error Correcting Codes (ECCs) to be effective for protecting memories against SEUs. Hence, this work focusses on analysing faults in the data pipeline whilst protecting the smaller control blocks using Triple Modular Redundancy (TMR).
To analyse the effect of SEU-induced bit faults on a complex DNN workload, we constructed a fault injection simulation framework (highlighted in figure 2). This framework essentially injects SEUs at the output nodes of flipflops by inverting their logical value. For each DNN inference (propagation of one input image through the DNN model), the framework selects a random clock cycle and a random flipflop. When the simulation reaches the selected clock cycle, it inverts the bit in this flipflop. These experiments were performed for 3 different DNN models and 2 different datasets, namely (1) a 3-layer fully-connected (3L-FC) model with the MNIST dataset, (2) LeNET model with MNIST and finally (3) a modified LeNET model with CIFAR10. For each of these networks, at least 500000 faults were injected over a set of at least 2000 different input images.
For each injected fault we compare the DNN model output to a golden reference output which can result in either: (1) no difference, (2) a numerical difference but no classification misprediction or (3) a DNN model misclassification. Our experimental results show that an injected fault leads to a difference in the DNN model output with a probability of up to 27%. However, this probability drops to 3% when only looking at a classification fault (Pclass = 3%). When observing the DNN models separately, we observed that, for the MNIST models, LeNET has a significantly lower Pclass, specifically 0.3% for LeNET-MNIST versus 2.5% for 3L-FC-MNIST. For LeNET-CIFAR10, Pclass increases up to 3%. This highlights a large impact of the DNN model as well as the DNN dataset on the total impact of SEUs on a DNN accelerator SoC. Finally, we also observed a significant difference in fault probabilities between different flipflop groups. The flipflops in the 8-bit matrix multiplication registers, 8-bit rounding and non-linear function registers (see figure 1) have a maximum Pclass of 0.05% and are thus negligible in fault contribution. In contrast the 32-bit matrix multiplication registers and accumulator have a much higher maximum Pclass (respectively 1% and 3%), highlighting the importance of the dynamic range of registers as well as the reuse of values in the accumulator. These results clearly show that, only mitigating faults in the most sensitive registers (32-bit and accumulator) will already massively improve the accuracy of DNN inference without requiring a mitigation strategy that covers the entire data pipeline, resulting in a much more efficient design.