

### The Level 1 Scouting system of the CMS experiment

T. James (CERN), on behalf of the CMS L1 Scouting team

21st International Workshop on Advanced Computing and Analysis Techniques in Physics Research

27th Oct 2022

## CERN Openlab





### Two stage trigger system

|             | Phase 1 | Phase 2<br>(High Lumi |
|-------------|---------|-----------------------|
| Peak pileup | 60      | 200                   |
| BX rate     | 40 MHz  | 40 MHz                |
| L1 rate     | 100 kHz | 750 kHz               |
| L1 latency  | < 4 µs  | < 12 µs               |
| HLT rate    | 2 kHz   | 7.5 kHz               |





### 40 MHz Scouting: What does L1 accept miss?

- Can we acquire L1 trigger data at full bunch crossing rate
- subset of detector information, limited resolution **>>**
- Allows for analysis of certain topologies at full rate
- semi real-time analysis and/or storing of tiny event record **>>**
- Demonstrated for first time at end of 2018

### **Physics cases**

- » Heavy Stable Charged particles over multiple BX
- Channels where available cuts give low efficiency at attributed rate budget **>>**
- » Any long-lived leptonic decays e.g soft displaced muons



- **Diagnostic and monitoring capabilities**
- » BX-to-BX correlations always available
- Independent per-bunch lumi measurement **>>**







| L1 Scouting  | 40 MHz scouting syste                                                           |                                                                                                |  |
|--------------|---------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|--|
| Demonstrator | Global Muon                                                                     | x 8, 10                                                                                        |  |
| Run 3        | Trigger                                                                         | x 8, 10 Gb/s (d                                                                                |  |
| <image/>     | Calorimeter<br>Trigger Layer 2<br>Barrel Muon<br>Track Finder<br>Global Trigger | e/γ, jets<br>missi<br>x 7 (+1),<br>BMTF /<br>(muon<br>x 24, 10<br>Global t<br>algo<br>x 18, 10 |  |



### Hardware: rule of three

### Xilinx KCU1500

### Xilinx VCU128





- > PCIe Gen3x8 x2
- > KU115
- > 2x QSFP

- > PCIe Gen3x16 or PCIe Gen4x8
- > VU37P (w/ 8GB HBM)
- > (4 + 6 w/mezzanine) QSFP

### **Micron SB-852**



- > PCIe Gen3x16
- > VU9P
- > 2x QSFP
- > 64 GB DDR4



### CMS 40 MHz Scouting with Xilinx KCU1500





## Why ML for scouting?

- Trigger objects calibrated for a given efficiency at a threshold
- » For triggering, not physics analysis
- Use the offline objects as target to re-calibrate the parameters of the trigger level objects
- We have full offline reco & trigger objects for Zero Bias and **Triggered events**
- **Inputs** L1 objects e.g  $\mu$ GMT muons:
- **Target** Offline fully reconstructed objects
- Use of classical *fully connected* neural networks to 'recalibrate' L1 > information to improve their utility for an online analysis







## $\mu GMT$ re-calibration with Neural Network

track parameter precision for some interesting areas of phase-space



- muon trigger algorithms
- offline muon tracks for matched muons ( $\Delta R < 0.1$  at 2nd muon station)

## NN shown to universally improve precision of $\phi$ , $\eta$ and p<sub>T</sub>, able to achieve ~2x improvement in

Trained with Zero-bias dataset 2017, 2018, re-run with Run 3 trigger emulation for up-to-date

 $\Delta \eta$ ,  $\Delta \varphi$ ,  $\Delta p_T$  is the difference between the prediction (or  $\mu GMT$  extrapolated) values, and the





## **Micron Deep Learning Accelerator (MDLA)**

- Offers ~Tera MAC (multiply-accumulate operations) /s



Proprietary Inference Engine firmware, scalable and programmable solution to deep learning inference













- Micron SB-852 for optical input -> DMA to PC
- MDLA is embedded within the infrastructure & L1

10

### **MDLA precision**

- > Three ways of running:
  - » Full software e.g tensorflow, ONNX real-time
  - » In the hardware SB-852
  - » Micron-provided sw emulator (100% accurate!)
- > To improve precision:
- » "Scaling" Integer inputs / 256
- » Batch normalisation
- > Q8.8 & Variable Fixed Point (VFP) modes available
- Target precision is to be < L1 object LSB step size of same variable e.g < 0.5 GeV p<sub>T</sub>

Precision |hardware - tensorflow software|Frac. Values < 1% diff</th>

Model w/ integer inputs





11

## **SB-852 resource utilisation & throughput**

### **VU9P - MDLA w/ VFP**

| N DLA<br>clusters | LUTs [%] | BRAM [%]             | URAM [%]                            |
|-------------------|----------|----------------------|-------------------------------------|
| 0                 | 2.72     | <b>28.10</b> buffers | eadout<br>needed <b>0.21</b><br>DLA |
| 1                 | 21.61    | 28.96                | 6.88                                |
| 2                 | 29.95    | 43.70                | 13.33                               |

| N DLA<br>clusters | Inference rate | Average latency /<br>muon inference | Encod                |
|-------------------|----------------|-------------------------------------|----------------------|
| 4 cluster         | <b>5.2 MHz</b> | <b>192 ns</b>                       | <b>Q</b> 8.          |
| 2 cluster         | <b>2.6 MHz</b> | <b>385 ns</b>                       | Variable<br>Point (V |

Not yet able to fit 4 clusters w/ VFP



SB-852 infrastructure + L1 scouting firmware

|                         |       |                         |            |                           | SIR2                    |
|-------------------------|-------|-------------------------|------------|---------------------------|-------------------------|
| Х0Ү14                   | X1Y14 | Х2Ү14                   | X3Y14      | X4Y14                     | ST45X                   |
| X0Y13                   | XIY13 | X2Y13                   | X3Y13      | X4Y13                     | X5Y13                   |
| X0Y12                   | XIY12 | X2Y12                   | X3Y12      | X4Y12                     | X5Y12                   |
| X0Y10 X0Y11 X0Y12 X0Y13 | тіліх | X2Y10 X2Y11 X2Y12 X2Y13 | X3Y11      | X4Y10 X4Y11 X4Y12 X4Y13   | XSY10 XSY11 XSY12 XSY13 |
| οīγοχ                   | X1Y10 | 0172X                   | X3Y10      | X4Y10                     | X5Y10                   |
| S. C.                   | е, 19 | ŝŋ/                     |            | - End                     |                         |
|                         |       |                         |            | 415 X418                  |                         |
| X - QUAX                |       |                         | X376<br>X3 |                           |                         |
| dğ                      | Q     |                         | j į į      | ·2<br>文<br>文文<br>中文<br>中文 |                         |
|                         | N.    | X274                    | ХЗҮ4       | Х4Ү4                      | SLR0<br>74<br>X         |
| X0Y3                    | Х1У3  | X2Y3                    | ХЗҮЗ       | Х4ҮЗ                      | X5Y3                    |
| X0Y2                    | X1Y2  | X2Y2                    | X3Y2       | X4Y2                      | X5Y2                    |
| τλοχ                    | τλτχ  | X2Y1                    | X3Y1       | X4Y1                      | X5Y1                    |
| хоуо                    | 0/LX  | X2Y0                    | X3Y0       | X4Y0                      | X5Y0                    |

SB-852 infrastructure + L1 scouting firmware + 2 clusters of MDLA

| 4                             |        |      |              |                                         |
|-------------------------------|--------|------|--------------|-----------------------------------------|
| Į,                            |        |      | -            |                                         |
| 9                             |        |      |              |                                         |
| <b>S</b>                      |        |      |              |                                         |
| 8                             |        |      |              | ļ                                       |
| 8                             |        |      |              |                                         |
| xovio xovis vovis xovis xovi4 |        |      |              |                                         |
| 2                             | ÷.,    |      |              |                                         |
| ×                             |        |      |              |                                         |
| ĝ                             |        |      |              |                                         |
| ×.                            |        |      |              |                                         |
|                               |        |      |              |                                         |
| ξΠ.                           |        |      |              |                                         |
|                               |        |      |              |                                         |
|                               |        |      |              |                                         |
|                               | 28 - L |      |              |                                         |
|                               |        |      |              | Xen |
|                               | ×      |      |              |                                         |
|                               |        |      |              |                                         |
| 4                             | 8      |      | 1            |                                         |
| :<br>                         | e      | 1    |              |                                         |
| X .                           | ÷      |      |              |                                         |
|                               | 4      | pp.  |              | 4                                       |
| <b>Jair</b>                   |        |      |              |                                         |
|                               |        |      |              |                                         |
| е<br>Хох                      | X1X8   |      |              | X473                                    |
|                               |        |      |              |                                         |
| X0Y2                          | X1Y2   |      | No.          | X4Y2                                    |
|                               |        |      |              |                                         |
| τλοχ                          | τλτx   |      | X3Y1         | X4Y1<br>X4                              |
|                               | 2      |      | ×            |                                         |
| 0,0%                          | 0,YIX  | X2Y0 | X3Y <b>0</b> | X470                                    |
|                               |        |      |              |                                         |







### 40 MHz scouting w/ VCU128

- > (4 + 6 w/ mezzanine) QSFPs & HBM
- Replace DMA w/ TCP/IP to surface
- Replace FIFO chain w/ HBM
- DMA data-taking also supported >





**GTY** input emulation logic

Legend:





### 250 MHz HBM clock





# VCU128 - NN w/ hls 4 ml

- Integrated NN for muon recalibration generated w/ HLS4ML\*
- > Q6.12 precision, pruning factor 0.5
- > 2 NN each process 4 muons / BX
- > Latency  $\lesssim$  100 ns FIFO latency, can accept 2 muons / clock



\*Python API & command line tool that translates trained NNs to synthesizable FPGA firmware

https://fastmachinelearning.org/hls4ml/ https://arxiv.org/abs/1804.06913











### **Plans for CMS Phase 2**

### **New L1 trigger for CMS at HL-LHC**

### L1 scouting will have stageable architecture



- 1. GT inputs & Outputs (sDS)
- 2. Calo & Muon local reco (sLS)
- 3. Tracker tracks (sTS)
- Calo primitives (sPS) 4.



### Summary

- > L1 Scouting demonstrator system in operation, taking data from  $\mu GMT$  and CALO trigger Layer 2
- Three FPGA boards: Xilinx KCU1500, VCU128 & Micron SB-852
- Applying ML inference w/ help of Micron DLA framework and/or HLS4ML >
- » for re-calibration of parameters and
- » fake detection
- » w/ real performance gains
- Full system in development w/ DAQ800 board for CMS at HL-LHC

















### **Backup: Fake muon pair classifier**

- Network consists of 8 recalibration branches & 4 classification branches
- Trained/tested with Run 3 Zero-bias data



|                      |                              |          | p <sub>T</sub><br>p <sub>T</sub> | Qual<br>Qual | <i>q</i><br><i>q</i> |
|----------------------|------------------------------|----------|----------------------------------|--------------|----------------------|
|                      |                              | 28 nodes | Dense<br>BN<br>Relu              |              |                      |
|                      | false positive:<br>ler curve | 12 nodes | Dense<br>BN<br>Relu              |              |                      |
| I - only<br>p - only | 89.2%<br>97.4%               | 20 nodes | Dense<br>BN<br>Relu              |              |                      |
| p - only<br>All      | 97.7%<br>97.2%               | 1 nodes  | Dense<br>BN<br>Sigmoid           |              |                      |
|                      |                              |          | Class                            |              |                      |











