

### **Dylan Rankin [UPenn] SmartHEP Edge Machine Learning School** September 23rd, 2024

# experiments



### Introduction

- ML is becoming more and more popular, HEP and LHC is no exception
- HEP trends in ML towards bigger and more complicated models, more computing
- Availability of CPUs, GPUs, modern software has accelerated adoption



2203.15823

### 2403.05618



### HEP ML

- Majority of ML in physics is "off detector"
  - System latency/resource limits are typically soft (if at all)
  - No radiation
  - Issues do not impact data collection
    - Can re-run algorithms/workflows





### What if...

- What if:
  - System latency/resource limits are low?
  - High radiation?
  - No undo button?
- Requires dedicated hardware, strategies
- $\rightarrow$  Edge ML
  - "Edge ML is the process of running ML algorithms on computing devices at the periphery of a network to make decisions and predictions as close as possible to the originating source of data."
  - Placing ML at sensors, running in real-time



### Edge ML

- Want to focus on two main components of edge ML:
- Specialized hardware → specialized tools
  - FPGAs, ASICs, GPUs (and more)
- Data source, format must be considered to be effective
  - How does data get to specialized device?
  - Does it arrive all at once?
  - Does it come with the features I want already?
  - Are there limitations from the environment/device/task?



## Large Hadron Collider (LHC)

### bunch

LHGb

ALICE

### 10<sup>11</sup> proton bunch

### **25 ns 40 MHz**



GMS





- Level-1 Trigger (FPGAs, ASICs) O(μs) hard latency
- High Level Trigger (CPUs, GPUs, FPGAs?) O(100 ms) soft latency
- Offline (CPUs, GPUs)  $\rightarrow$  1 s latencies

### 1 ms



- Level-1 Trigger O(µs) latency
- High Level Trigger O(100 ms) latency
- Offline  $\rightarrow$  1 s latencies

### 1 ms

If we don't interesting identify events in trigger we lose them forever!





- Level-1 Trigger O(µs) latency
- High Level Trigger O(100 ms) latency
- Offline  $\rightarrow$  1 s latencies

If we don't interesting identify events in trigger we lose them forever!



### What is an FPGA?

- Field-programmable gate array
- Building blocks:
  - Multiplier units (DSPs) [arithmetic]
  - Look Up Tables (LUTs) [logic]
  - Flip-flops (FFs) [registers]
  - Block RAMs (BRAMs) [memory]
- Algorithms are wired onto the chip
  - Can only use the resources on the chip
- Run at high frequency: hundreds of MHz, O(ns) runtime





### Inference on FPGAs

- Each part of network must be placed on the FPGA, connected together
- Cannot implement an algorithm if there are no resources left
  - Cannot just run things slower (25 ns!)













## Many Tools (Tutorials this week)

• NNs:



arXiv: 1804.06913

Boosted Decision Trees (BDTs):



arXiv: 2002.02534

- Different tools have different methodology, target different designs/problems
- Entirely non-exhaustive list...



arXiv: 2004.03021



arXiv: 2104.03408

## **ML Size / Complexity**

- Regardless of toolkit, limitation of doing edge ML is device size
  - Bigger device  $\rightarrow$  more resources  $\rightarrow$  more computation  $\rightarrow$  larger ML models

Xilinx Virtex Ultrascale+ VU13P 12288 Multipliers 1.7M LUTs 3.4M FFs 95 Mb BRAM

- Alternatively, is it possible to reduce network size without hurting performance?
  - Pruning and quantization are two potential ways







• Are all the pieces a given network necessary?



- Are all the pieces a given network necessary?
- Many different types of pruning
- Magnitude-based:
  - Use regularization (penalty term in loss) function for large weights)
  - Remove smallest weights
  - Repeat
- Multiplications by 0 can be completely removed from FPGA design



20 -

 $10^{-7}$ 

 $10^{-6}$ 

 $10^{-4}$ 

Absolute Relative Weights

 $10^{-5}$ 

10-3

10-2

 $10^{-1}$ 

 $10^{0}$ 





- Are all the pieces a g
- Many different types
- Magnitude-based:
  - Use regularization 5
    function for large w 3
  - Remove smallest v
  - Repeat
- Multiplications by 0 c from FPGA design



- Are all the pieces a g
- Many different types
- Magnitude-based:
  - Use regularization
    function for large w
  - Remove smallest v
  - Repeat
- Multiplications by 0 c from FPGA design



- Are all the pieces a g
- Many different types
- Magnitude-based:
  - Use regularization function for large w
  - Remove smallest v
  - Repeat
- Multiplications by 0 c from FPGA design





- Are all the pieces a g
- Many different types
- Magnitude-based:
  - Use regularization function for large w
  - Remove smallest v
  - Repeat
- Multiplications by 0 c from FPGA design





- FPGAs are well suited to fixed-point numbers, not floating point
- Bitwidth can be adjusted as needed (impacts accuracy, performance, resources)
  - Can be combined with other customizations
- Quantization-aware training [arXiv:2006.10159]
  - Can greatly reduce size of network by training with knowledge of quantization





- FPGAs are well suited to fixed-point numbers, not floating point
- Bitwidth can be adjusted as needed (impacts accuracy, performance,
- resou • Ca CU Quan integer [arXiv tra



- FPGAs are well suited to fixed-point numbers, not floating point
- Bitwidth can be adjusted as needed (impacts accuracy, performance, resources)
  - Can be combined with other customizations
- Quantization-aware training [arXiv:2006.10159]
  - Can greatly reduce size of network by training with knowledge of quantization



- FPGAs are well suited to fixed-point numbers, not floating point
- Bitwidth can be adjusted as needed (impacts accuracy, performance, resources)
  - Can be combined with other customizations
- Quantization-aware training [arXiv:2006.10159]
  - Can greatly reduce size of network by training with knowledge of quantization



- FPGAs are well suited to fixed-point numbers, not floating point
- Bitwidth can be adjusted as needed (impacts accuracy, performance, resources)
  - Can be combined with other customizations
- Quantization-aware training
  [arXiv:2006.10159]
  - Can greatly reduce size of network by training with knowledge of quantization



### LHC Applications



### Particle Identification

- LHC triggers must differentiate different collections of particles / detector signals from overwhelming backgrounds
  - τ lepton, bottom quark
  - Light quarks, gluons, noise, combinatorics
- Edge ML can enable this faster / better



## L1 T Identification

- NN algorithm capable of accepting more τ leptons than traditional cut-based method
- Network is 3 layer dense model, uses information about particle  $p_T$ ,  $\eta$ ,  $\phi$ , and type
- Outputs decision in 38 ns (9 clocks @ 240) MHz)



![](_page_26_Figure_5.jpeg)

![](_page_26_Figure_6.jpeg)

![](_page_26_Figure_7.jpeg)

CMS TDR-021

## L1 b-quark Identification

- NN trained to identify b-quarks using collection of particles
- Architecture includes featurizers that act on each particle individual
- Significantly improved acceptance for HH→bbbb events with low mHH (compared to traditional cut-based methods)
  (a features/particle) (5 features/particle) (50 features)

![](_page_27_Figure_4.jpeg)

(1 feature) b-tag score

Pointwise convolution (per particle dense layer)

b

![](_page_27_Figure_7.jpeg)

**4** 

hls

![](_page_27_Figure_8.jpeg)

### **L1 Electron Identification**

- Electrons are complex signatures
  - Multiple sub detectors (tracker & calorimeter)
  - Undergo bremsstrahlung ( $e \rightarrow e + \gamma$ )
- Edge ML well-suited to electron ID
  - Handles correlations between different inputs
  - 5-10% improvement in plateau efficiency
- Important for many different physics signatures

![](_page_28_Figure_8.jpeg)

![](_page_28_Picture_10.jpeg)

![](_page_28_Picture_11.jpeg)

![](_page_28_Picture_12.jpeg)

- What if we don't know exactly what new physics looks like?
  - $\rightarrow$  anomaly detection (AD)
- Can reduce network size by removing decoder, using latent space directly

![](_page_29_Figure_4.jpeg)

![](_page_29_Picture_5.jpeg)

- What if we don't know exactly what new physics looks like?
  - $\rightarrow$  anomaly detection (AD)
- Can reduce network size by removing decoder, using latent space directly

![](_page_30_Figure_4.jpeg)

![](_page_30_Figure_6.jpeg)

### Train on ZeroBias LHC data

Bottleneck: autoencoder learns to compress high dimensional inputs into low dimensional latent space

 $x - \hat{x}$  represents degree of abnormality

![](_page_30_Figure_11.jpeg)

T. Aarrestad, CMS ML Townhall

![](_page_30_Picture_13.jpeg)

T. Aarrestad, CMS ML Townhall

![](_page_30_Picture_15.jpeg)

![](_page_30_Picture_16.jpeg)

![](_page_30_Picture_17.jpeg)

![](_page_30_Picture_18.jpeg)

- What if we don't know exactly what new physics looks like?
  - $\rightarrow$  anomaly detection (AD)
- Can reduce network size by removing decoder, using latent space directly (allows to achieve <50 ns latency)

Train on ZeroBias LHC data

![](_page_31_Figure_5.jpeg)

![](_page_31_Figure_7.jpeg)

Bottleneck: autoencoder learns to compress high dimensional inputs into low dimensional latent space

 $x - \hat{x}$  represents degree of abnormality

![](_page_31_Figure_11.jpeg)

T. Aarrestad, CMS ML Townhall

![](_page_31_Figure_13.jpeg)

T. Aarrestad, CMS ML Townhall

![](_page_31_Picture_15.jpeg)

![](_page_31_Figure_16.jpeg)

![](_page_31_Figure_17.jpeg)

- What if we don't know exactly what new physics looks like?
  - $\rightarrow$  anomaly detection (AD)
- Can reduce network size by training student network to predict teacher network MSE

![](_page_32_Figure_4.jpeg)

![](_page_32_Figure_8.jpeg)

![](_page_32_Picture_10.jpeg)

![](_page_32_Picture_11.jpeg)

![](_page_32_Picture_12.jpeg)

## L1 AD Trigger

- CMS has already deployed multiple AD algorithms in trigger
  - AXOL1TL [CMS DP-2023/079, CMS DP-2024/059] & CICADA [CMS DP-2023/086]
- Currently collecting interesting events that would have been missed
  - Network preferentially identifies large multiplicity events
  - Potentially large gains in new physics acceptance

![](_page_33_Picture_6.jpeg)

![](_page_33_Figure_7.jpeg)

- detector in the first place?!
- ASIC, logic triplicated) [2105.01683]

![](_page_34_Picture_5.jpeg)

## LAr Peak Finding

- ATLAS LAr calorimeter needs to measure time and energy of pulses
  - Overlapping pulses difficult for simple, fast algorithms to handle (150 ns = 6 BXs)
- CNN and LSTM architectures both able to significantly improve performance
  - Well-suited for data structure, able to account for non-linear correlations

![](_page_35_Figure_5.jpeg)

![](_page_35_Figure_6.jpeg)

![](_page_35_Figure_7.jpeg)

![](_page_35_Picture_10.jpeg)

![](_page_35_Picture_11.jpeg)

## LAr Peak Finding

- ATLAS LAr calorimeter needs to measure time and energy of pulses
  - Overlapping pulses difficult for simple, fast algorithms to handle (150 ns = 6 BXs)
- CNN and LSTM architectures both able to significantly improve performance
  - Well-suited for data structure, able to account for non-linear correlations

![](_page_36_Figure_5.jpeg)

![](_page_36_Figure_6.jpeg)

![](_page_36_Figure_8.jpeg)

![](_page_37_Figure_1.jpeg)

- Level-1 Trigger O(µs) latency
- High Level Trigger O(100 ms) latency
- Offline  $\rightarrow$  1 s latencies

If we don't interesting identify events in trigger we lose them forever!

### LHC Data Processing / Readout

Trigger

### 1 us 1 ns

**40 MHz** 

### • Level-1 Trigger

### **High Level Trigger**

• Offline

![](_page_38_Picture_5.jpeg)

### If we don't interesting identify events in trigger we lose them forever!

![](_page_38_Picture_8.jpeg)

## **HLT b-tagging**

- Early usage of ML at LHC for b-tagging
  - High complexity, physics motivation  $\rightarrow$  significant ML gains
- Algorithms have to evolve quickly to keep up with modern ML
  - BDTs  $\rightarrow$  MLPs & DeepSets  $\rightarrow$  GNNs (+ attention)
- Tiered reconstruction/filtering allows running computationally intensive algorithms in trigger

![](_page_39_Figure_6.jpeg)

![](_page_39_Figure_7.jpeg)

### **ATLAS BJet Trigger Public Results**

![](_page_39_Figure_12.jpeg)

## **GNN Tracking**

- Tracking is an incredibly hard problem, tracking in HLT even harder
  - Huge combinatorics, only going to get worse
- GNNs show promise for HL-LHC

![](_page_40_Figure_6.jpeg)

![](_page_40_Picture_7.jpeg)

ATL-COM-DAQ-2024-004

### LHC Data Processing / Readout

Trigger

### 1 us 1 ns

**40 MHz** 

### • Level-1 Trigger

### **High Level Trigger**

• Offline

![](_page_41_Picture_5.jpeg)

### If we don't interesting identify events in trigger we lose them forever!

![](_page_42_Picture_0.jpeg)

![](_page_42_Picture_3.jpeg)

## **GNN Tracking (LHCb)**

- LHCb must do tracking at 30 MHz
  - Exploring use of GNNs [2407.12119]
- Demonstrated good performance possible
- Achieving necessary throughput is a challenging problem

![](_page_43_Figure_5.jpeg)

![](_page_43_Figure_6.jpeg)

### Lipschitz Monotonic NN

- On-detector ML is not just about speed
  - Robustness and understandability are also very important
- Networks can be made provably monotonic [2112.00038]
- LHCb has used this technique to design NNs for use in HLT
  - Eg. smooth dependence on flight distance for heavy flavor decays

![](_page_44_Figure_6.jpeg)

![](_page_44_Figure_7.jpeg)

![](_page_44_Figure_10.jpeg)

### **Continual Learning**

- On-detector ML has no re-do button
  - Cannot just reprocess with new network if conditions change
- Continual learning method uses mix of original and new data to retrain model
  - Better performance than simple retraining (or no retraining)
- Important consideration especially when conditions can change significantly
- Example from CMS considers degradations in L1 tracking

![](_page_45_Figure_12.jpeg)

CMS DP-2023/022

![](_page_45_Picture_14.jpeg)

### **Continual Learning**

- On-detector ML has no re-do button
  - Cannot just reprocess with new network if conditions change
- Continual learning method uses mix of original and new data to retrain model
  - Better performance than simple retraining (or no retraining)
- Important consideration especially when conditions can change significantly
- Example from CMS considers degradations in L1 tracking

![](_page_46_Figure_12.jpeg)

CMS DP-2023/022

![](_page_46_Picture_14.jpeg)

### **Continual Learning**

- On-detector ML has no re-do button
  - Cannot just reprocess with new network if conditions change
- Continual learning method uses mix of original and new data to retrain model
  - Better performance than simple retraining (or no retraining)
- Important consideration especially when conditions can change significantly
- Example from CMS considers degradations in L1 tracking

![](_page_47_Figure_12.jpeg)

CMS DP-2023/022

![](_page_47_Picture_14.jpeg)

### Conclusions

- Increasingly possible and necessary to perform real time edge ML in LHC experiments
  - FPGA and GPUs are main hardware tools but not only ones!
- ML offers improved performance over traditional algorithms
  - With advancing ML off-detector brings better alignment of offline and online algorithms
- Applications in many other fields, areas too!

![](_page_48_Figure_6.jpeg)

![](_page_48_Picture_7.jpeg)

### BACKUP

![](_page_49_Picture_2.jpeg)

## A Toroidal LHC ApparatuS (ATLAS)

![](_page_50_Figure_1.jpeg)

![](_page_50_Picture_2.jpeg)

### **ATLAS Slice**

![](_page_51_Figure_2.jpeg)

![](_page_51_Picture_3.jpeg)

### **ATLAS Slice**

![](_page_52_Figure_2.jpeg)

![](_page_52_Picture_3.jpeg)

![](_page_52_Picture_7.jpeg)

![](_page_53_Figure_0.jpeg)

![](_page_53_Picture_3.jpeg)

### L1 AD

![](_page_54_Picture_1.jpeg)

CMS Experiment at the LHC, CERN Data recorded: 2023-May-24 01:42:17.826112 GMT Run / Event / LS: 367883 / 374187302 / 159

![](_page_54_Picture_3.jpeg)

![](_page_54_Picture_4.jpeg)

Jannicke Pearkes

![](_page_54_Picture_6.jpeg)

### Reuse

- For lowest latency, compute all multiplications at once
  - Reuse = 1 (fully parallel)
    → latency = # layers)
- Larger reuse implies more serialization
- Allows trading higher latency for lower resource usage

![](_page_55_Picture_5.jpeg)

![](_page_55_Figure_6.jpeg)

![](_page_55_Figure_7.jpeg)

![](_page_55_Picture_8.jpeg)

![](_page_56_Figure_0.jpeg)

## Heterogeneous Computing

- Direct connect
  - Simple connections
  - Reduced network load
- As-a-service (aaS)
  - Simple support for mixed hardware
  - Scaleable
  - Throughput optimizations for multiple-client
  - Simple client-side

![](_page_57_Figure_9.jpeg)

![](_page_57_Figure_10.jpeg)

![](_page_57_Picture_11.jpeg)

### As-a-service computing

 Biggest gains come from algorithms that are faster to run on accelerator, workflows that can be parallelized

![](_page_58_Figure_3.jpeg)

![](_page_58_Picture_5.jpeg)

### As-a-service computing

 Biggest gains come from algorithms that are faster to run on accelerator, workflows that can be parallelized

![](_page_59_Figure_2.jpeg)

### **Processor as-a-Service**

![](_page_59_Picture_5.jpeg)