

# PulseDL-II: A system-on-chip neural network accelerator for timing and energy extraction of nuclear detector signals

Pengcheng Ai (Speaker), Zhi Deng, Yi Wang, Hui Gong, Xinchi Ran, Zijian Lang Department of Engineering Physics, Tsinghua University 8/1/2022

#### Three Elements of Design Perspectives

- Three independent elements:
- A. Nuclear Electronics: Readout system and signal features
- B. Neural Network: Architectural research, network training...
- C. Digital Design: NN accelerator and system-on-chip scheme
- > Overlay of elements:
- **AB. Application Training**: NN algorithm research for nuclear signals, selection and optimization of network architecture
- **BC. Hardware Mapping**: Accelerator hardware implementation of NN
- AC. System Prototype: Hardware design in the context of readout system
- ABC. Joint Validation: Synthesis of the above three





# TABLE OF CONTENT

- Signal feature extraction with NN
- NN accelerator-based readout system
- System-on-Chip Accelerator Design
- System Validation
- Summary



#### What and to What Extent Neural Nets Can Do (AB)





Estimation of heterogeneous uncertainty of nuclear detector signals with ensemble of NNs

#### P.Ai et al 2022 JINST 17 P02032



Z

Computation of the Cramer Rao lower bound of timing to find out limits for NN and traditional methods

P.Ai et al 2021 JINST 16 P09019



## Building Blocks of Network Structure (and Why They Work) (AB)





one-dimensional

convolution

one-dimensional deconvolution

fully-connected matrix multiplication

- We choose Convolutional Neural Networks (CNN) because they succeeded in many ML tasks and facilitated parallel computing
- We select four representative building blocks:
  - I d convolution layer
  - I d deconvolution layer
  - fully-connected layer
  - nonlinear activation (ReLU)
  - Nonlinearity is the key for Inductive Learning (and thus intelligent signal processing)
    J.C.Ye (2022) Geometry of Deep Learning, Springer
    - Without nonlinearity, the weights in the mapping function are the same for any input sample. Once learned, they never change. (transductive)
    - With nonlinearity, weights in the mapping function are selectively turned off/scaled by nonlinear function. (inductive)

#### Autoencoder-Based Network Architecture (AB)





- Regression network can be located at the far-end of the decoder if an accurate noiseless waveform can be obtained.
- Regression network can also be located at the bottleneck if we only have original waveform (and the decoder is optional).



#### Quantization-Aware Training and Validation (AB)





#### Why Bringing Them to Front-End Electronics (AC)





#### Why Bringing Them to Front-End Electronics (AC)



- A case study of electromagnetic calorimeter (ECAL) of NICA-MPD
  - 64-channel, 12-bit, 62.5M-rate ADC
  - Waveform data readout, triggering & timing by optical fiber
  - power consumption: 250 mW/channel, water cooling system is needed for heat dissipation
- Front-end upgradation with ASIC
- pre-amplifier, 200M-rate ADC & NN accelerator







Reduce power and bandwidth and improve performance



#### A Brief Review of PulseDL (BC)





#### Limitations of PulseDL (BC)



- The first version of the chip, although a successful practice, has the following limitations:
  - A RISC CPU outside the chip (or NN accelerator) is needed to schedule transactions
  - Dynamic quantization scheme is adopted and may bring about additional time budget
  - The adder tree structure has much space for improvement (especially the temporal adder tree)
  - Only manual configuration was done, and deep learning framework had not been supported yet
- The above limitations motivate us to develop *PulseDL-II*, the new version of the chip

#### PulseDL-II: Improvement in System Structure (BC)





- Integrate an RISC CPU into the digital design to form System-on-Chip (SoC)
  - RISC CPU: ARM Cortex-M0
  - System Bus:AHB/APB
- The PulseDL-II NN accelerator is mounted on the processor AHB bus as a peripheral
- Input/Output peripherals:
  - Quad/Normal SPI
  - UART (with or without internal buffer)
  - JTAG
  - GPIO

#### PulseDL-II: Improvement in Accelerator Architecture (BC)



Compared to the last version:

- Adding a new topological level: Arithmetic Unit (AU)
- Broadcasting of input feature map and kernel
- Optimizing the adder tree with partial sum accumulator
- Adding function blocks for bias addition and activation
- For quantization compatible with TensorFlow or other deep learning frameworks, rescale and shift are supported

B.Jacob et al 2018 CVPR 2704 3

#### Hardware-Software Codesign (BC)







## Embedded Software with Weight-Stationary Mapping (BC)

- The designed hardware allows different mapping rules
- For NNs with small/medium size, a weight-stationary mapping scheme can be adopted
  - Weights are stored into PEs before samples come in (Preparation Phase)
  - Only input data, output data and intermediate feature maps are transferred during inference (Inference Phase)
- The embedded software enables following features:
  - Layer-wise inference pipelining: weights for different layers are mapped to different groups of PEs, and they can operate simultaneously
  - Event-level parallelism: Each event is assigned a unique token, which will be passed in company with feature maps along the pipeline



#### Evaluation of Performance, Power and Area (BC)

MUL8



#### Evaluation settings

- Xilinx ZCU104 Evaluation Board
- I00 MHz working frequency
- post-synthesis

(*PulseDL-II* NN accelerator is **isolated** for fair comparison with *PulseDL*):



LUT/MUL8

FF/MUL8



Power (energy consumption) 1.81x less

#### Area

(resource utilization) comparable or less



#### System Validation: Experimental Setup (ABC)



Host Computer I. FPGA Firmware, Integrated Logic Analyzer; 2.ARM MCU Program; 3. Feature Output



#### System Validation: Digital Logic for Data Acquisition (ABC)



#### System Validation: Experimental Results (ABC)





- Sample 32 points per event
- Dual-channel synchronous NNs waveform input



-0.1

-0.2

0.0

time (ns)

0.1

energy Waveform integration mean = 1255.293, std. = 17.006, resolution = 1.355% 1.36% 1240 1260 1280 normalized energy Floating-point NN mean = 0.999, std. = 0.002, resolution = 0.231% 0.23% 0.992 0.994 0.996 0.998 1.000 1.002 1.004 1.006 normalized energy Quantized NN mean = 1.001, std. = 0.004, resolution = 0.402% 0.40%

0.990

0.995

1.000

normalized energy

1.005

1.010

1.015

- Runtime statistics:
  - Zynq UltraScale+

| Resources (area):      |                   |
|------------------------|-------------------|
| LUT                    | 2825 + 89540      |
| FF                     | 517 + 75028       |
| BRAM                   | 8.0 + 48.0        |
| URAM                   | 8 + 0             |
| Power:                 |                   |
| Dynamic                | (0.371 + 0.541) W |
| Static                 | 0.594 W           |
| Performance @ 100 MHz: |                   |
| Internal inf.          | 113.8 us          |
| Throughput             | 8.3k events/sec   |

### Summary



- The ability and potential of NNs in signal feature extraction are investigated
- Application-specific NN architectures are designed
- NN accelerator-based front-end electronics is prototyped
- System-on-Chip digital system with NN accelerator is developed
- System Validation on FPGA platform is done

What's next:

- Evaluate the whole system in real-world nuclear detector dataflows
- Design optimization, ASIC layout, tape-out with advanced technology



## THANK YOU!

#### **ANY QUESTIONS?**