# Design and Testing of a reconfigurable AI-ASIC for front-end data compression at the HL-LHC

**Fermilab**: F. Fahim, C. Gingu, C. Herwig, J. Hirschauer, C. Mantilla Suarez, D. Noonan, P. Rubinov, N. Tran **Baylor**: J. Wilson

> 2022 IEEE Real Time 5th August 2022

### With thanks to the CMS collaboration, the CMS High-Granularity Calorimetry group, hls4ml, and the lpGBT designers (ePortRX + PLL).

<u>https://lpgbt-support.web.cern.ch/</u> <u>https://fastmachinelearning.org/hls4ml/#</u> <u>https://cms.cern/</u>









## CMS High Granularity Calorimeter HGCAL

- A high granularity detector to deal with high occupancy.
- Harsh radiation environment: full volume operated at -30C.
- ~50 layers of active material (Si, scintillator) + absorber:
  - Each front layer is tiled with 300-500 8" hexagonal silicon modules.
- Spatial granularity: 6M channels in ~40 m<sup>3</sup>.

#### Absorbed Dose at 3000 fb<sup>-1</sup>







ECON-T is an on-detector data concentrator ASIC for the trigger path. It aggregates, selects and compresses charge data @ 40 MHz. It runs an encoder neural network as one of the data compression algorithms.





## The HGCAL trigger data challenge

#### HGCROCv3:

#### **Raw-data**

Sends sum of 4 (9) channels (7-bit floating point format) @ 1.28 Gpbs/s.

40 Tb/s, 1M channels 5 Pb/s, 6M channels 300 Tb/s, 1M channels ~2 x 1.28 Gbps links per module ~ 9k 10.24 Gbps links in total

#### **ECON:**

Traditional algorithm selects trigger charge data.







### The ECON ASIC overview



- Latency: 400 ns = 16 clock cycles
  - Encoder NN: 50 ns
- Radiation tolerance: 200 Mrad, 1 × 10<sup>16</sup> Neq/cm<sup>2</sup>
  - Using 65nm CMOS with standard cells characterized for radiation performance.
- Low power: ≤5 mW/channel
- 1.28 Gbps links: 12 inputs and 13 outputs (most of the modules use only 2 outputs)
- Packaging: 128-pin Low Profile Quad Flat Pack
  - 200 ASICs have been packaged from 300 produced parts in P1.







ASIC design (including AI on chip)



### ASIC blocks

#### 12 input receivers



Word

Multiplexer Aligner and calibration

#### 13 output transmitters

Algorithms Formatter and Buffer





## The data compression algorithms



| Fixed latency                                  |                                            |                                                     |  |
|------------------------------------------------|--------------------------------------------|-----------------------------------------------------|--|
| Best Choice                                    | Super Trigger<br>Cell                      | Encoder Neur<br>Network                             |  |
| ts TC by charge<br>, sends N with<br>largest Q | Groups TC and<br>forms larger super<br>TCs | Encodes TC<br>with fully<br>reconfigurab<br>weights |  |





Al on chip

#### Original input

### Encoder in ASIC



#### Encoder ondetector ASIC

#### 48x7bit input **336 bits**

#### Compressed representation

**Reconstructed output** 



Decoder offdetector FPGA



#### Transmit 16 x 3bit outputs\* **48** bits

Decoded 48pixel image

\*for low occupancy zones













## layer

## geometry



## Architecture optimization





## How to optimi

Use energy mover's distance a the "work" required to rearran



Input image





| Гуре | Run Time | Iterations | Size          |  |
|------|----------|------------|---------------|--|
| D    | 1s       | 50-100     | 1.1k lines of |  |
| V    | 1s       | 30-100     | C++           |  |
| D    | 30 min   | 2 100      | 40k lines of  |  |
| V    | 1 min    | 3-100      | verilog       |  |
| D    | 6 hrs    |            | 750k gatas    |  |
| V    | 30 min   |            | 750k gates    |  |
| D    | 50 hrs   | 6          |               |  |
| V    | 1 hrs    | 0          | 7º0k gatas    |  |
| V    | 2 hrs    |            | 780k gates    |  |
| V    | 4 hrs    |            |               |  |
| D    | 20 min   | 1          | 7.6M          |  |
| V    | 1 hr     | T          | transistors   |  |



|       | Metric                             |
|-------|------------------------------------|
| се    | EMD                                |
| ption | Number of registers and operations |





more low-precision outputs is better than few high-precision outputs

#### One example of what we learned during the optimization:





### Physics-driven hardware co-design

Independent verification of Encoder NN



QKeras/TF Training based on LHC simulation



### **Dense layers** Convolutional

## layer

## geometry



## ECONASIC place and route



| Metric             | Simulation           | Target  |
|--------------------|----------------------|---------|
| Power              | 48 mW                | <100 m\ |
| Energy / inference | 1.2 nJ               | N/A     |
| Area               | 2.88 mm <sup>2</sup> | <4 mm   |
| Gates              | 780k                 | N/A     |
| Latency            | 50 ns                | <100 ns |

Encoder NN block (distributed i2c)





### ASIC Testing



## Testing setup



FPGA



ECON-T test\_pack\_V1 P. Rubinov / N. Hoibenko 3/2022

ALL TEST, PERM 230.00

O

19

ASIC powered @ 1.2 V

Socket testing

Individual power domains





#### FPGA provides Fast Command (FC) clock, i<sup>2</sup>c, input and simulated



Simulation/Emulation is key: live comparison of output data stream, captures data mis-matches





## Functionality has been fully verified

- Tested under different configurations (number of eTx, algorithms, ...) and input test vectors (random data, LHC simulation, ...).
- 1.28 Gbps outputs agree perfectly with simulation/emulation in test bench.
- Power-up-state-machine, PLL, eTx, Formatter, buffer: everything works.
- Total power consumption with Encoder NN below 450mW (cf. 500 mW target).



#### 1.28 Gpbs eTx eye diagram





Radiation Tolerance Testing Tests for Single Event Effects (SEE) and Total Ionizing Dose (TID)

## Reminder of Single Event Effects (SEE)

- A single particle can induce localized and non-cumulative radiation effect: bit flips, clock/ logic transients or permanent damage.
- To protect against bit flips and transients: use triplication (TMR) and hamming based error correction codes (ECC).

\*Credit to Elena Vernazz











## ECON-TP1 SEE protection

|                                   | Total      | TMR                                   | AutoCorrect | ECC |
|-----------------------------------|------------|---------------------------------------|-------------|-----|
| Data                              |            | On flip flops                         | No          | No  |
| i2c registers                     | 675 bytes  | On flip flops                         | Yes         | Yes |
| Encoder i2c weights<br>and biases | 1608 bytes | TMR on flip flops,<br>logic and clock | Yes         | No  |



## SEE Radiation testing

- ECON-T P1 ASIC tested at two different irradiation facilities:
  - At FNAL with 400 MeV proton beam. Flux for hadrons w. E > 20 MeV: 2E+15 cm<sup>-2/s</sup>
  - At medical facility with 217 MeV proton beam. Flux for hadrons: 5E+9 cm<sup>-2/s</sup>
  - Flux HL-LHC: 3E+6 cm<sup>-2/</sup>s
- Validate overall chip performance (input alignment, PLL, serializer, data logic) by monitoring 1.28GPbs outputs.
- Also, check i2c registers to confirm stability.





## Radiation testing results

| Facility            | Fluence<br>(p/cm <sup>2</sup> ) | Preliminary                       |
|---------------------|---------------------------------|-----------------------------------|
| ITA                 | 9.6E+12                         | Bit flips on En<br>Increase in EC |
| Medical<br>facility | 5.4E+12                         | Bit flips on En                   |

HL-LHC Fluence: 1E+14

- Overall excellent performance (including Encoder).
- Low / acceptable cross section for bit errors on not-fully triplicated data path.
- Low cross section for serializer errors (serializer design is already improved in v2).

#### **Observations in i2c (without extracting bit** cross section)

- coder RW registers: 4
- CC error counters but no bit flips: 10
- coder RW registers: 10



### Conclusions

- ECON-T ASIC design is complete and includes an Encoder neural network for on-detector data compression.
  - Low power, low latency, radiation tolerant and fully reconfigurable weights.
- Preliminary ECON-T-P1 testing successful:
  - Overall chip and Encoder NN perform very well: algorithm and other block functionality has been fully verified.
    - Minor RTL bugs will be fixed in production.
  - Larger scale testing of 300 parts in progress.



