### Design of a reconfigurable autoencoder neural network for detector front-end ASICs

INFIERI 2021 – August 31, 2021

Columbia University : Giuseppe Di Guglielmo, Luca Carloni Fermilab : Farah Fahim, Cristian Gingu, Christian Herwig, Jim Hirschauer, Martin Kwok, Nhan Tran Florida Tech : Danny Noonan Northwestern University : Manuel Valentin, Yingyi Luo, Seda Memik











#### With thanks to the CMS Collaboration, and in particular, the CMS High–Granularity Calorimeter Group





# Thanks also to **\$\$\$ FAST MACHINE LEARNING LAB**

https://fastmachinelearning.org/

2020 Fast ML for Science workshop: https://indico.cern.ch/event/924283/

Please join the next workshop : tentatively end-of-2021 / early-2022



#### HEP data challenge

HEP aims to discover increasingly more massive particles, probe smaller distances, and study more rare processes.

This requires a series of colliders with continually increasing **energy** and **luminosity** 

→ increasing **detector occupancy** 

→ increasing detector granularity and precision



→ increasing data volume produced by detector

"The solution to every problem is another problem." Johann Wolfgang von Goethe

### HEP data challenge

| Collider                                       | Tevatron               | LHC                   | HL-LHC                           | FCC-hh                 |
|------------------------------------------------|------------------------|-----------------------|----------------------------------|------------------------|
| Luminosity [cm <sup>-2</sup> s <sup>-1</sup> ] | 3.7 × 10 <sup>33</sup> | 21 × 10 <sup>33</sup> | $50 	imes 10^{33}$ with leveling | 300 × 10 <sup>33</sup> |
| Pileup                                         | 1–2                    | 50                    | 200                              | 1000                   |
| Typical number of tracker channels             | <1M                    | >100M                 | >1B                              | 17B **                 |
| Typical number of calorimeter channels         | <100k                  | >100k                 | 6M                               | 100M ***               |
| Inner detector TID                             | 10 Mrad                | 100 Mrad              | 500 Mrad                         | 30 Grad *              |



#### Data challenge solutions $\rightarrow$ new problems

Increasing detector data volume

- → move more data processing to on-detector electronics
  - → increasing complexity, power consumption, and radiation tolerance

What data processing should move on-detector?

- data compression
- reconstruction of low-level objects (hits, clusters)
- reconstruction of high-level objects (tracks, jets)

#### On-detector data compression

- This talk: Neural Network (NN) autoencoder in ASIC for on-detector data compression.
- General requirements for on-detector electronics:
  - Low power consumption → well suited to ASIC
  - Radiation tolerant → well suited to ASIC
  - Complexity: design must be re-configurable → challenging for ASIC
- Specific requirements for the CMS High-Granularity Calorimeter (HGCAL).



Illustration: Lisa Hornung/iStockPhoto

### Context within HL-LHC Data Challenge



### CMS High Granularity Calorimeter (HGCAL)

- "Imaging calorimeter" with ~6M readout channels.
  - $60 \times$  increase from current LHC calorimeters.
- ~50 layers of active material + absorber.
  - silicon sensors in front layers
  - scintillator + silicon in back layers





### CMS High Granularity Calorimeter (HGCAL)

Each layer tiled with 300-500 8" hexagonal

Each 8" module includes either 192 or 1sor channels



10

#### Front End electronics: each 8" module includes

- ≤6 HGCROC ASIC : digitizes charge and arrival time and provides charge data for trigger path.
- 1 ECON-T ASIC: selects/compresses digital trigger data for transmission off-detector.
  - On-debictor data scorptels on with

machinicitenting

#### Imaging calorimeter

500 GeV jet in 140 pileup



#### Imaging calorimeter



Jingyu Zhang ICHEP 2020

### HGCAL trigger data challenge

| Trigger path stage  | Number<br>channels | bits/<br>channel | Average<br>Compression factor | Data rate* | # links*<br>(10.24 Gbps) |
|---------------------|--------------------|------------------|-------------------------------|------------|--------------------------|
| Raw data            | 6M                 | 20               | 1                             | 5 Pb/s     | 1M                       |
| Hardware reduction  | 1M                 | 7                | 1                             | 300 Tb/s   | 60k                      |
| Threshold selection | 1M                 | 7                | 7                             | 40 Tb/s    | 9k                       |

\* Assumes 40 MHz rate and 50% link packing efficiency







on sensors → 48 ....s (TC) @ 7b per TC



#### Specific challenges and requirements for the on-detector ASIC

#### **Occupancy and pileup:**

- Varies by 2-3 orders of magnitude over pseudorapidity/depth and in time.
- Compression neural network must be configurable to handle different detector locations and changing detector/beam conditions

#### Latency:

 ~On trigger path latency is precious → must be < 100 ns</li>

#### **Power:**

- ~30k encoder networks on the entire detector.
- Power budget is 100 mW network, or around 1 nJ per inference.

#### Radiation tolerance : up to 500 Mrad



#### Autoencoder concept for data compression





#### Encoder NN design considerations

- Minimize : power + area + latency
- Maximize : physics performance + configurability + radiation tolerance
- Network architecture and precision of weights and biases: fixed in design
- Fully re-configurable : all network weights and biases + dimensionality of output



#### Encoder NN design considerations

Encoder NN components

- Convolutional layer (conv2D): extract geometric features
- Flatten layer : vectorizes 2D image from conv2D (  $128 = 8 \times 4 \times 4$ )
- **Dense layer :** decide which geometric features are important
- **ReLU** : activation function

**Encoder NN** 



### Encoder NN architecture optimization

• Optimize encoder network architecture choices including :



#### Performance metric : EMD

Center for Theoretical Physics, Massa

a without relying upon a choice

Model training

• In principle, wseparate of the dist.

ECON, with retring the retrievent of the second states of the second sta

• For now shpartition sense

•Using calibrated TC inp

The Met

trick T. Komis

Department of Physics,

- For rapid prototyping
- Energy Mover's Dist
  - the "work" require
  - first associated wi

Jun 3

 For each NN variatior including top quarks

Input image

From ighter of intricately correlated particles, when energetic quarks and gluons are involve energy in an event is a robust memory of compare input with no defini Osts th **19**13] or **9**0]. athologies: icar entical obse se to r small per ty u (P)ust **M** nts e o**p fol**lide **APCC** strue Date the earth ( аtе

Model training

The Metric Space of

Center for Theoretical Physics, Massachusetts Institute of Department of Physics, Harvard Universi

In principle, wseparate wmsore spearate ECON, with retriction the spece of celling hors before the • For now, shpartitudes the second sock Soveriloew without relying upon a choice of observables. Mo •Using calibrated TC inputs, em From ig the structure of intricately correlated particles especially ited particles, especially when energetic quarks and gluons are involved. Behind Performancemmetricer, purely flow of energy in an event is a robust memory of its simpler Compare apput Withingly, no definition of the similarity between events presently exists that sharply encoded hoe coded line ages a metric, efforts typically fall back upon ad no methods such (without pixels of calorimeter mages [9–13] or matching electron suffer from significant pathologies: disparate disparate suffer from significant pathologies: disparate values of the second s ues, while pixels lack stability under small perturbations. A (Atheweig ate derme app) ust Rhois of the "distance between events would profoundly expand - our ability to explore, the structure of collider data and In this letter, we advocate for the earth (or energy) mover's distance (EMD) [18–22] as a metric for the space

Jun 3. 2020

C. Herwig – ECON

recharing de Minis the Distree interanounce anounce recharing and the second states and the second s energies docter sensional seg minimuminionkin regoiledrequear age anagenne inent Etheorapidie azimits primet signal esipered approximation of the the other the other Svebyentove in antigof fending in fparticle particle 1, F.M.D. 195 / Co.V. the  $\operatorname{tra}$ momethemetheric and the second and t Quantify encoding performance a set Month of article other: mo even incargaterrated in the source of the so transform decoded image  $\Rightarrow$  inpute, image  $\sum_{ij} \lim_{\{f_i\} \in I \\ ij \in I \\$ Eq Rme  $\begin{array}{l} \text{EMD}(\mathcal{E}, \mathcal{E}') = \min_{\{f_{ij}\}} \sum_{j \neq j} \operatorname{Rap}_{ij} \underbrace{\operatorname{dis}}_{R} \underbrace{\operatorname{dis}}_{R} \underbrace{\operatorname{Li}}_{R} = \sum_{i \neq j} E_{i} - \sum_{j \neq j} E_{j} \Big|, \quad (1) \\ \text{The EMDhehefe theorem of proposed proposed in Eqs. (1) Eqs. (1)} \end{aligned}$  $f_{ij} \ge 0, f_{ij} \ge f_{ij} \le E_i f_{ij} \ge E_j f_{ij} \ge E_j f_{ij} \ge E_j f_{ij} \ge E_n f_{ij} = E_n f_{ij} =$ 

a metric fc

3-22

### Physics driven hardware co-design

Rapid prototyping and optimization of network achieved through

- **QKeras** : network development with **quantization-aware training** and physics simulation
- hls4ml : neural network description (h5 file e.g.)  $\rightarrow$  HLS-compliant C++ format
- Catapult HLS : C++ → RTL
- TMR4sv\_hls : Automated TMR for System Verilog



### Rapid design optimization

- Performance : EMD mean and RMS are both important
- Power and area : scale with number of model operations and parameters

#### Lower EMD is better

|                     |          | Network  | Archit | ecture |         | Relative Power & Area |              | Relative Performance |         |
|---------------------|----------|----------|--------|--------|---------|-----------------------|--------------|----------------------|---------|
| Test feature        | Geometry | # filter | kernel | stride | pooling | # params              | # operations | EMD Mean             | EMD RMS |
| Reference           | 4x4x3    | 8        | 3x3    | 1      | none    | 1.00                  | 1.00         | 1.00                 | 1.00    |
| 4x4x3 -> 8x8        | 8x8      | 8        | 3x3    | 1      | none    | 2.73                  | 1.76*        | 0.64                 | 0.41    |
| max pooling         | 8x8      | 8        | 3x3    | 1      | 2x2     | 0.71                  | 0.97*        | 0.59                 | 0.33    |
| 3x3 -> 5x5 kernel   | 8x8      | 8        | 5x5    | 1      | 2x2     | 0.99                  | 2.76         | 0.64                 | 0.35    |
| pooling -> stride=2 | 8x8      | 8        | 3x3    | 2      | none    | 0.94                  | 0.59         | 0.76                 | 0.46    |
| 8 -> 10 filters     | 8x8      | 10       | 3x3    | 2      | none    | 1.17                  | 0.73         | 0.73                 | 0.43    |
| 8 -> 6 filters      | 8x8      | 6        | 3x3    | 2      | none    | 0.70                  | 0.44         | 0.85                 | 0.57    |

\* zero operations removed

• **Reference design** : presented in Fall 2020\*\*

- Final design :  $8 \times 8$  geometry + 8 filters +  $3 \times 3$  kernel + stride = 2
  - 50% power and 80% area of reference (from simulation)
  - 2× better performance (EMD RMS) than reference

\*\* https://indico.cern.ch/event/924283/contributions/4105329/attachments/2152250/3630590/encoder\_asic\_fastml2020.pdf https://www.eventclass.org/contxt\_ieee2020/online-program/session?s=N-34#e280 https://www.eventclass.org/contxt\_ieee2020/online-program/session?s=N-24#e189

#### **Optimization of NN output**

- Better to use many low-precision or fewer high-precision outputs?
- Compare EMD performance keeping power and area fixed.
- Conclusion : more lower-precision outputs is better
  - for both high- and low-bandwidth scenarios
  - for full range of module occupancy

ECON ASIC allows user to **select any** of 16×9 output bits for transmission

- Expect to use 16 × 3 (9) bits for low (high) occupancy zones.
- Corresponding precision used in QKeras quantization-aware training optimizes network for programmed output configuration.



### Single event effect mitigation

#### Data path : Encoder & Convertor





- New data every 25ns
- Triplicate registers without auto-correction



- Long term weights storage
- Triplicate registers, logic, and clocks
- Auto-correction included

### Design and verification methodology

Verification performed at each stage of design:

- Model training
- hls4ml
- Catapult HLS
- RTL
- Synthesis
- Place and route
- LVS and DRC



## Design and verification methodology

| Step                      | Туре | Run Time | Iterations | Size          |                 |
|---------------------------|------|----------|------------|---------------|-----------------|
| Model generation          | D    | 1s       | 50-100     | 1.1k lines of | Network         |
| C Simulation              | V    | 1s       | 50-100     | C++           | optimization    |
| HLS                       | D    | 30 min   | 3-100      | 40k lines of  | Design          |
| RTL simulation            | V    | 1 min    | 5-100      | verilog       | optimization    |
| Logic synthesis           | D    | 6 hrs    |            | 750k gatas    |                 |
| Gate-level sim            | V    | 30 min   |            | 750k gates    |                 |
| Place and route           | D    | 50 hrs   | 6          |               | Increasing time |
| Post-layout sim           | V    | 1 hrs    | 0          | 790k gatos    | and complexity  |
| Post-layout parasitic sim | V    | 2 hrs    |            | 780k gates    |                 |
| SEE simulation            | V    | 4 hrs    |            |               |                 |
| Layout                    | D    | 20 min   | 1          | 7.6M          |                 |
| LVS and DRC               | V    | 1 hr     |            | transistors   |                 |

#### Place and route

• Integrated design to avoid routing congestion from 14k bits of weights (programmable via I<sup>2</sup>C) connected from periphery.



#### Distributed i2c weights 26

### **Design Performance Metrics**

Physics performance studies in progress → preliminary performance with nonoptimized training comparable to traditional threshold algorithm.

| Requirements            |                                       |  |  |  |  |  |
|-------------------------|---------------------------------------|--|--|--|--|--|
| Rate                    | 40 MHz                                |  |  |  |  |  |
| Total ionizing dose     | 200 Mrad                              |  |  |  |  |  |
| High energy hadron flux | $1 \times 10^7  \text{cm}^2/\text{s}$ |  |  |  |  |  |

| Metric             | Simulation           | Target             |
|--------------------|----------------------|--------------------|
| Power              | 48 mW                | <100 mW            |
| Energy / inference | 1.2 nJ               | N/A                |
| Area               | 2.88 mm <sup>2</sup> | <4 mm <sup>2</sup> |
| Gates              | 780k                 | N/A                |
| Latency            | 50 ns                | <100 ns            |

### ECON-T-P1 submitted

- ECON-T-P1 submitted for fabrication on June 28, 2021.
- Chips expected to reach Fermilab in early October 2021.
- We are ready and excited to test the chip and evaluate the performance of NN encoder



#### Summary

- Autoencoder neural network for on-detector data compression.
  - Low power, low latency, radiation tolerant, fully re-configurable
  - 65nm LP CMOS
  - Prototypes will be tested in Fall 2021
- Established design and verification methodology based on hls4ml + Catapult HLS allows rapid progression from algorithm development through circuit implementation.
- Optimized network provides 2× better performance at ~50% power of reference network.

#### Acknowledgements

- ECON design team for inclusion in ECON ASIC : Davide Braga, Mike Hammer, Jim Hoff, Paul Rubinov, Alpana Shenai, Cristina Mantilla Suarez, Chinar Syal, Xiaoran Wang, Ralph Wickwire
- CMS HGCAL for simulated training images
  - Jean-Baptiste Sauvan for simulation development
  - Andre Davide for useful discussion on network optimization
- hls4ml developers : Javier Duarte, Phil Harris, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Sioni Summers https://fastmachinelearning.org/hls4ml/
- Mentor/Siemens Catapult HLS : Sandeep Garg and Anoop Saha
- Cadence Innovus and Incisive : Bruce Cauble and Brent Carlson

### Additional material

### Precision of weights and variables

- Diagram is example for 4×4×3 reference network – same structure as final 8×8 network
- Weights are all 6b

For final 8×8 network:

- hidden layer neurons:
  - 8b fraction
  - sufficient integer bits to cover theoretical max value
- output neurons:
  - 9b total
  - 1b integer
  - covers maximum value from physics simulation

