## **FWXMACHINA**

Nanosecond machine learning with boosted decision trees for high energy physics





Tae Min Hong

- Paper JINST 16 P08016 (2021)
- Info <a href="http://fwx.pitt.edu">http://fwx.pitt.edu</a>
- Code <a href="http://gitlab.com/PittHongGroup/fwX">http://gitlab.com/PittHongGroup/fwX</a>

PIKIMO 2021

December 4, 2021

<a href="https://indico.cern.ch/event/1091676/">https://indico.cern.ch/event/1091676/</a>

# Machine learning at <u>L1 trigger</u>









## **BDT** design

- Algorithm structure
- Firmware design

#### Results

- VBF Higgs vs. multijet
- (Electrons vs. photons)

## Comparisons

- vs. hls4ml's BDT
- vs. hls4ml's neural network

JINST 16 P08016 (2021)

Not in paper





I will focus on the first two steps



#### **Conventional tree structure**



#### 2d plane: x<sub>a</sub> vs. x<sub>b</sub>



#### Normal decision tree

Recursive in the number of depths

#### **Firmware**

Not our design

#### TM Hong

# Flattened decision tree







#### 2d plane: x<sub>a</sub> vs. x<sub>b</sub>



#### Flattened decision tree

Axes are independent → Bin search problem on a grid

#### **Firmware**

Our design

# Pre-merge the forest





#### Pre-merging trees

- Pre-processed in software before implementation in firmware
- No impact on physics performance

#### **Firmware**

Our look-up table design

# Motivation: VBF Higgs vs. multijet





#### Same production, two decays

- H → vvvv, "invisible"
- H → bbbb, thru pseudoscalars

#### Strategy

- Train on Multijet vs. VBF H → vvvv
- Apply to Multijet vs. VBF H → bbbb

#### Why

- Can trigger on VBF Higgs → anything
- Does it work? Yes, next slide



## Results

It works!

#### Performance

- Efficiency: 2x better vs. HL-LHC ATLAS
- Latency : 16 ns = 5 clock ticks

#### **Details**

Validation: Eff. matches ATLAS Run-2 paper



## **BDT vs. neural networks**

Step function for 1d



Neural network 2d



Decision tree 2d



## **Activation function for NN**

Fuzzy boundary using a turn-on function





## Forest of decision trees





# **Fuzzy boundaries**





## **Data format**

We encode in N-bit integers
 E.g., ap\_int(3) means 0 - 7 range

Advantages

Represent variety of precision, e.g.,  $p_T$  from GeV-TeV vs.  $\varphi$  from 0-2 $\pi$ 

Subtleties

Transformation adds 1-bit ambiguity

$$c_{\text{int}} = f(c_{\text{float}}) = \left[ \frac{c_{\text{float}} - c_{\text{min}}}{c_{\text{max}} - c_{\text{min}}} \cdot \left( 2^N - 1 \right) \right]$$

Equal up to one bit because of floor

$$f(x_1 + x_2) = f(x_1) + f(x_2)$$

Firmware adds the pre-evaluated f(x)

hls4ml encodes in fixed pt

E.g., ap\_fixed(5,2) means

00.000 00.001 00.002 00.003 00.004 00.005 00.006

Advantages

More familiar to physicists who use floating / fixed values

Subtleties

Need to use "quantized aware training" to reduce the number of bits



## fwX BDT vs. hls4ml BDT

#### Setup

- Details in paper, same as possible
- Public datasets of e vs. γ
- ROC is same bec. use same BDT

### Comparison

Latency (ticks)

LUT

- Lower latency
- Lower LUT, FF
- Higher BRAM (but 0.1% of avail.)





5.5

**BRAM** 

FF

## fwX BDT vs. <u>hls4ml neural network</u>

#### Setup

- Details <u>not</u> in paper
- Using 200 MHz clock here
- Chose NN architecture to match ROC performance of BDT

## Comparison

- Lower latency
- Comparable LUT, FF
- Higher BRAM (but 0.3% of avail.)







# Thank you to my collaborators





#### Paper authors



A. Rodic (did the comparison with hls4ml's neural network)

# Back up

# Benchmark firmware perform'ce



| Parameter                             | Value                       | Comments                    |  |
|---------------------------------------|-----------------------------|-----------------------------|--|
| FPGA setup                            |                             |                             |  |
| Chip family                           | Xilinx Virtex Ultrascale+   |                             |  |
| Chip model                            | xcvu9p-flga2104-2L-e        |                             |  |
| Vivado version                        | 2019.2.1                    |                             |  |
| Synthesis type                        | C-Synthesis                 |                             |  |
| HLS or RTL                            | HLS                         |                             |  |
| HLS interface pragma                  | None                        |                             |  |
| Clock speed                           | $320\mathrm{MHz}$           | Clock period is 3.125 ns    |  |
| ML training configuration             |                             |                             |  |
| ML training method                    | Boosted decision tree       | Binary classification       |  |
| Boost method                          | Adaptive                    | AdaBoost with yes/no leaf   |  |
| No. of event types to classify        | 2                           | Signal vs. background       |  |
| No. of input variables                | 4                           |                             |  |
| No. of trees used for training        | 100                         |                             |  |
| Maximum tree depth                    | 4                           |                             |  |
| Nanosecond Optimization configuration |                             |                             |  |
| Bin Engine type                       | BIT SHIFT BIN ENGINE (BSBE) |                             |  |
| No. of bits for input variables       | 8 bits for each             |                             |  |
| No. of bits for cut thresholds        | 8 bits for each             |                             |  |
| No. of bits for BDT output score      | 8 bits                      |                             |  |
| No. of trees after merging            | 10                          | Tree Merger via ordered lis |  |
| No. of final trees                    | 10, none removed            | Tree Remover by truncation  |  |
| No. of bins                           | 26132                       | Cut Eraser not used         |  |
| FPGA cost                             |                             |                             |  |
| Latency                               | 3 clock ticks               | 9.375 ns                    |  |
| Interval                              | 1 clock tick                | 3.125 ns                    |  |
| Look up tables                        | 1903 out of 1182240         | < 0.2% of available         |  |
| Flip flops                            | 138 out of 2364480          | < 0.01% of available        |  |
| Block RAM                             | 8 out of 4320               | < 0.2% of available         |  |
| Ultra RAM                             | 0 out of 960                | -                           |  |
| Digital signal processors             | 0 out of 6840               | -                           |  |



 10 ns is independent of clock from 100-320 MHz

# **ATLAS-inspired cuts**



Table 9: List of input variables for the classification of the VBF Higgs boson vs. multijet process. Also given are the ATLAS-inspired cut-based offline thresholds for Run 2 [64] and HL-LHC [65]. For Run-2, differences arise with respect to the document when the  $m_{jj}$  threshold is quoted as 1100 GeV for L1 MJJ-500-NFF; we use the > 99% offline efficiency point, which is achieved around  $m_{jj}$  > 1300 GeV. for others the offline thresholds are used. For HL-LHC, the single-level scheme values are quoted. The performance of the cut-based approach using these values is compared the performance to the BDT result in figure 16. The non-optimized (non-opt) configuration includes the five variables from the optimized configuration.

| Input                            | Description                              | ATLAS Run-2 offline   | ATLAS HL-LHC offline  | Used in BDT |
|----------------------------------|------------------------------------------|-----------------------|-----------------------|-------------|
| variable                         |                                          | cut [64], see caption | cut [65], see caption |             |
| $p_{\mathrm{T1}}$                | Leading jet $p_{\rm T}$                  | > 90 GeV              | > 75 GeV              | _           |
| $p_{\mathrm{T2}}$                | Subleading jet $p_{\rm T}$               | > 80 GeV              | > 75 GeV              | Optimized   |
| $p_{\mathrm{T}12}$               | $Sum p_{T1} + p_{T2}$                    | -                     | -                     | Optimized   |
| $ \eta_1 $                       | Leading jet $\eta$                       | < 3.2                 | -                     | -           |
| $ \eta_2 $                       | Subleading jet $\eta$                    | < 4.9                 | -                     | -           |
| $\prod_{\eta}$                   | Product $\eta_1 \cdot \eta_2$            | -                     | -                     | Optimized   |
| $ \Delta \overset{\cdot}{\eta} $ | Separation in $ \eta_2 - \eta_1 $        | > 4.0                 | > 2.5                 | -           |
| $ \Delta \phi $                  | Separation in $ \phi_2 - \phi_1 $        | < 2.0                 | < 2.5                 | non-opt     |
| $ \Delta R $                     | $\sqrt{(\Delta\eta)^2 + (\Delta\phi)^2}$ | -                     | -                     | non-opt     |
| $m_{jj}$                         | Dijet invariant mass                     | $> 1300\mathrm{GeV}$  | _                     | Optimized   |
| $p_T^{jj}$                       | Dijet $p_{\mathrm{T}}$                   | -                     | -                     | Optimized   |

## **Data flow**





 Each variable is processed independently of each other

#### TM Hong

# Firmware design: Bin Engines







 Look up thresholds in memory, compare

- Bit shift to localize data
  - This is fast
- Use combinatoric logic as much as possible without multiplication. No explicit clocked operations.

## Test bench





 No difference seen wrt software implementation













#### install









# vs. hls4ml's BDT

| Parameter                                                                   | FWXMACHINA                   | hls4ml/Conifer                    |  |  |  |
|-----------------------------------------------------------------------------|------------------------------|-----------------------------------|--|--|--|
| ML training setup                                                           |                              |                                   |  |  |  |
| Training software                                                           | TMVA                         | TMVA                              |  |  |  |
| Physics problem                                                             | electron vs. photon          | electron vs. photon               |  |  |  |
| Training samples                                                            | from ref. [56]               | from ref. [56]                    |  |  |  |
| No. of event classes                                                        | 2                            | 2                                 |  |  |  |
| No. of training trees                                                       | 100                          | 100                               |  |  |  |
| Max. depth                                                                  | 4                            | 4                                 |  |  |  |
| No. of input variables                                                      | 4                            | 4                                 |  |  |  |
| Other TMVA parameters                                                       | TMVA defaults TMVA defaults  |                                   |  |  |  |
| Nanosec. Optimization                                                       | Flattened & merged to 10     | N/A                               |  |  |  |
|                                                                             | final trees, without TREE    |                                   |  |  |  |
|                                                                             | Remover or Cut Eraser        |                                   |  |  |  |
| FPGA and firmware setup                                                     |                              |                                   |  |  |  |
| Chip family                                                                 | Xilinx Virtex Ultrascale+    | Xilinx Virtex Ultrascale+         |  |  |  |
| Chip model                                                                  | xcvu9p-flga2104-2L-e         | xcvu9p-flga2104-2L-e              |  |  |  |
| Vivado HLS version                                                          | 2019.2                       | 2019.2                            |  |  |  |
| Clock speed, period                                                         | $320{\rm MHz},3.125{\rm ns}$ | $320{\rm MHz},3.125{\rm ns}$      |  |  |  |
| Precision                                                                   | $ap_int\langle 8 \rangle$    | ap_ufixed $\langle 10, 5 \rangle$ |  |  |  |
| BIN ENGINE                                                                  | BSBE                         | N/A                               |  |  |  |
| FPGA cost                                                                   |                              |                                   |  |  |  |
| Actual timing values and resource usage by RTL synthesis and implementation |                              |                                   |  |  |  |
| Latency                                                                     | 3 clock ticks, 9.375 ns      | 15 clock ticks, $46.875$ ns       |  |  |  |
| Interval                                                                    | 1 clock ticks, 3.125 ns      | 1 clock tick, 3.125 ns            |  |  |  |
| LUT                                                                         | 717, 0.06% of total          | 3834, 0.3% of total               |  |  |  |
| FF                                                                          | 147, < 0.01% of total        | 1966, < 0.1% of total             |  |  |  |
| BRAM 18k                                                                    | 5.5, 0.1% of total           | 0                                 |  |  |  |
| URAM                                                                        | 0                            | 0                                 |  |  |  |
| DSP                                                                         | 2,0.03% of total             | 0                                 |  |  |  |

