Fast Machine Learning for Science Workshop 2022

Name: Fast Machine Learning for Science Workshop 2022
Start: 2022-10-03T09:00:00-05:00
End: 2022-10-06T12:30:00-05:00
Location: Southern Methodist University

3–6 Oct 2022

Southern Methodist University

America/Chicago timezone

FKeras: A Fault Tolerance Library for DNNs

3 Oct 2022, 15:00

15m

Southern Methodist University

Contributed Talks

Olivia Weng

Many scientific applications require NNs to operate correctly in safety-critical or high radiation environments, including automated driving, space, and high energy physics. For example, physicists at the Large Hadron Collider (LHC) seek to deploy an autoencoder in a high radiation environment to filter their experimental data, which is collected at a high data rate (~40TB/s). This is challenging because the autoencoder must operate efficiently, within 200 ns, in a resource-constrained setting to process all the data as well as correctly amid high radiation. As such, the autoencoder’s hardware must be both efficient and robust.

However, efficiency and robustness are often in conflict with each other. Robust hardware methods like triple modular redundancy protect against faults by increasing resources by 200%, in turn reducing efficiency [1]. To address these opposing demands, we must understand the fault tolerance inherent to NNs. NNs have many redundant parameters, suggesting we do not need to introduce a blanket redundancy in the hardware —the common practice—when it is already present in the software. To identify where this redundancy exists in a NN, we present FKeras, an open-source tool that measures the fault tolerance of NNs at the bit level. Once we identify which parts of the NN are insensitive to radiation faults, we need not protect them, reducing the resources spent on robust hardware.

FKeras takes a fine-grained, bottom-up approach to evaluate the fault tolerance of NNs at the bit-level. The user can evaluate both floating point and quantized NNs, for which previous work had little support. Since FKeras builds on top of QKeras, a quantized NN library, users can easily adjust quantization settings as well as fault injection settings (like bit error rate, bit error location, transient versus permanent fault, etc.) during training and/or inference. Prior work [1-3] introduced tools to evaluate NN robustness; however, they are too coarse-grained or are closed source, precluding researchers from fully understanding the robustness of NNs. They also have limited quantization support. FKeras is open-sourced, allowing researchers to easily evaluate quantized NNs at the bit-level. Having a bit-level understanding is paramount when every bit counts, especially in resource-constrained settings at the extreme edge. FKeras is a first step towards providing an open-source way of identifying which bits must be protected and which do not.

We would like to extend FKeras to statically identify which bits are insensitive to faults, without simulation to save time. At the workshop, we look forward to discussing and better understanding the fault tolerance needs of science. We will keep these needs in mind as we continue to build FKeras, with the goal of better supporting the scientific community.

[1] Bertoa et al. "Fault Tolerant Neural Network Accelerators with Selective TMR." IEEE D&T’22.
[2] Chen et al. "Tensorfi: A flexible fault injection framework for tensorflow applications." ISSRE’20.
[3] Mahmoud et al. "Pytorchfi: A runtime perturbation tool for dnns." DSN-W’20.

Olivia Weng

Andres Meza (UC San Diego) Benjamin Hawks (Fermi National Accelerator Lab) Quinlan Bock (Fermilab National Accelerator Laboratory) Javier Mauricio Duarte (Univ. of California San Diego (US)) Nhan Tran (Fermi National Accelerator Lab. (US)) Ryan Kastner

FKeras FastML'22.pdf

Fast Machine Learning for Science Workshop 2022

FKeras: A Fault Tolerance Library for DNNs

Southern Methodist University

Speaker

Description

Primary author

Co-authors

Presentation materials

Choose timezone

Fast Machine Learning for Science Workshop 2022

Speaker

Description

Primary author

Co-authors

Presentation materials

Share this page

Direct link

Social networks

Calendaring