TWEPP 2025 Topical Workshop on Electronics for Particle Physics

Name: TWEPP 2025 Topical Workshop on Electronics for Particle Physics
Start: 2025-10-06T08:00:00+03:00
End: 2025-10-10T23:59:00+03:00
Location: Rethymno, Crete, Greece

6–10 Oct 2025

Rethymno, Crete, Greece

Europe/Athens timezone

Support

twepp@cern.ch

Analysis of Single Event Induced Bit Faults in a Deep Neural Network Accelerator Pipeline

8 Oct 2025, 12:00

16m

Rethymno, Crete, Greece

Aquila Rithimna Beach Crete, Greece

Oral Programmable Logic, Design and Verification Tools and Methods Logic

Mr Naïn Jonckers (KU Leuven - Magics Technologies)

Recent advancements in Artificial Intelligence (AI) and AI hardware accelerators have paved the way for on-edge AI processing with many benefits such as reduced data bandwidth and increased power efficiency. Applications in harsh radiation environments could also benefit from these improvements. However, due to the complex nature of both accelerators and the AI models running on them, the effect of Single Event Upset (SEU) induced faults isn’t understood very well. This research aims to perform an in-depth analysis of SEU induced faults on an AI accelerator under different workloads to obtain information for implementing an efficient fault mitigation strategy.

Summary (500 words)

In this research, we present an AI accelerator, specifically tailored at Deep Neural Networks (DNNs). This full System on Chip (SoC) contains a data processing pipeline, control logic, data and instruction memories, and a communication module for interfacing with external components (highlighted in figure 1). Like most AI accelerators, the largest part of its hardware comprises of memories and the data pipeline. Many studies have shown Error Correcting Codes (ECCs) to be effective for protecting memories against SEUs. Hence, this work focusses on analysing faults in the data pipeline whilst protecting the smaller control blocks using Triple Modular Redundancy (TMR).
To analyse the effect of SEU-induced bit faults on a complex DNN workload, we constructed a fault injection simulation framework (highlighted in figure 2). This framework essentially injects SEUs at the output nodes of flipflops by inverting their logical value. For each DNN inference (propagation of one input image through the DNN model), the framework selects a random clock cycle and a random flipflop. When the simulation reaches the selected clock cycle, it inverts the bit in this flipflop. These experiments were performed for 3 different DNN models and 2 different datasets, namely (1) a 3-layer fully-connected (3L-FC) model with the MNIST dataset, (2) LeNET model with MNIST and finally (3) a modified LeNET model with CIFAR10. For each of these networks, at least 500000 faults were injected over a set of at least 2000 different input images.
For each injected fault we compare the DNN model output to a golden reference output which can result in either: (1) no difference, (2) a numerical difference but no classification misprediction or (3) a DNN model misclassification. Our experimental results show that an injected fault leads to a difference in the DNN model output with a probability of up to 27%. However, this probability drops to 3% when only looking at a classification fault (Pclass = 3%). When observing the DNN models separately, we observed that, for the MNIST models, LeNET has a significantly lower Pclass, specifically 0.3% for LeNET-MNIST versus 2.5% for 3L-FC-MNIST. For LeNET-CIFAR10, Pclass increases up to 3%. This highlights a large impact of the DNN model as well as the DNN dataset on the total impact of SEUs on a DNN accelerator SoC. Finally, we also observed a significant difference in fault probabilities between different flipflop groups. The flipflops in the 8-bit matrix multiplication registers, 8-bit rounding and non-linear function registers (see figure 1) have a maximum Pclass of 0.05% and are thus negligible in fault contribution. In contrast the 32-bit matrix multiplication registers and accumulator have a much higher maximum Pclass (respectively 1% and 3%), highlighting the importance of the dynamic range of registers as well as the reuse of values in the accumulator. These results clearly show that, only mitigating faults in the most sensitive registers (32-bit and accumulator) will already massively improve the accuracy of DNN inference without requiring a mitigation strategy that covers the entire data pipeline, resulting in a much more efficient design.

Mr Naïn Jonckers (KU Leuven - Magics Technologies)

Mr Toon Vinck (KU Leuven - Magics Technologies) Mr Gert Dekkers (Magics Technologies) Prof. Peter Karsmakers (KU Leuven) Prof. Jeffrey Prinzie (KU Leuven)

figure_1_soc_block_diagram.svg

figure_2_simulation_fi_setup.svg

TWEPP2025_analysis_of_single_event_induced_bit_faults_in_a_dnn_accelerator_pipeline.pdf

TWEPP 2025 Topical Workshop on Electronics for Particle Physics

Support

Analysis of Single Event Induced Bit Faults in a Deep Neural Network Accelerator Pipeline

Rethymno, Crete, Greece

Speaker

Description

Summary (500 words)

Author

Co-authors

Presentation materials

Choose timezone

TWEPP 2025 Topical Workshop on Electronics for Particle Physics

Support

Speaker

Description

Summary (500 words)

Author

Co-authors

Presentation materials