3–6 Oct 2022
Southern Methodist University
America/Chicago timezone

FKeras: A Fault Tolerance Library for DNNs

3 Oct 2022, 15:00
15m
Southern Methodist University

Southern Methodist University

Speaker

Olivia Weng

Description

Many scientific applications require NNs to operate correctly in safety-critical or high radiation environments, including automated driving, space, and high energy physics. For example, physicists at the Large Hadron Collider (LHC) seek to deploy an autoencoder in a high radiation environment to filter their experimental data, which is collected at a high data rate (~40TB/s). This is challenging because the autoencoder must operate efficiently, within 200 ns, in a resource-constrained setting to process all the data as well as correctly amid high radiation. As such, the autoencoder’s hardware must be both efficient and robust.

However, efficiency and robustness are often in conflict with each other. Robust hardware methods like triple modular redundancy protect against faults by increasing resources by 200%, in turn reducing efficiency [1]. To address these opposing demands, we must understand the fault tolerance inherent to NNs. NNs have many redundant parameters, suggesting we do not need to introduce a blanket redundancy in the hardware —the common practice—when it is already present in the software. To identify where this redundancy exists in a NN, we present FKeras, an open-source tool that measures the fault tolerance of NNs at the bit level. Once we identify which parts of the NN are insensitive to radiation faults, we need not protect them, reducing the resources spent on robust hardware.

FKeras takes a fine-grained, bottom-up approach to evaluate the fault tolerance of NNs at the bit-level. The user can evaluate both floating point and quantized NNs, for which previous work had little support. Since FKeras builds on top of QKeras, a quantized NN library, users can easily adjust quantization settings as well as fault injection settings (like bit error rate, bit error location, transient versus permanent fault, etc.) during training and/or inference. Prior work [1-3] introduced tools to evaluate NN robustness; however, they are too coarse-grained or are closed source, precluding researchers from fully understanding the robustness of NNs. They also have limited quantization support. FKeras is open-sourced, allowing researchers to easily evaluate quantized NNs at the bit-level. Having a bit-level understanding is paramount when every bit counts, especially in resource-constrained settings at the extreme edge. FKeras is a first step towards providing an open-source way of identifying which bits must be protected and which do not.

We would like to extend FKeras to statically identify which bits are insensitive to faults, without simulation to save time. At the workshop, we look forward to discussing and better understanding the fault tolerance needs of science. We will keep these needs in mind as we continue to build FKeras, with the goal of better supporting the scientific community.

[1] Bertoa et al. "Fault Tolerant Neural Network Accelerators with Selective TMR." IEEE D&T’22.
[2] Chen et al. "Tensorfi: A flexible fault injection framework for tensorflow applications." ISSRE’20.
[3] Mahmoud et al. "Pytorchfi: A runtime perturbation tool for dnns." DSN-W’20.

Primary author

Co-authors

Andres Meza (UC San Diego) Benjamin Hawks (Fermi National Accelerator Lab) Quinlan Bock (Fermilab National Accelerator Laboratory) Javier Mauricio Duarte (Univ. of California San Diego (US)) Nhan Tran (Fermi National Accelerator Lab. (US)) Ryan Kastner

Presentation materials