

#### David W. Miller

Enrico Fermi Institute



January 15, 2019





D.W. Miller (EFI, Chicago)

#### **Outline**

#### 7 Challenges of the Energy and Luminosity Frontier

2 ATLAS Phase I & II Hadronic Trigger Systems

3 Machine learning using FPGAs and MPSoCs



D.W. Miller (EFI, Chicago)

#### The overwhelming hadronic environment of the LHC HL-LHC: $\mathcal{L}_{inst} = 10^{35} cm^{-2} s^{-1} = 0.1 \text{ pb}^{-1} s^{-1} = 30 \text{ kHz of dijet events}$



D.W. Miller (EFI, Chicago)

## Hadronic final states: major part of LHC physics program



Physics may be compromised due to trigger & data proc. limitations
Even if we *can* trigger, offline data management may be a bottle-neck

D.W. Miller (EFI, Chicago)

## Hadronic final states: major part of LHC physics program



- Physics may be compromised due to trigger & data proc. limitations
- Even if we can trigger, offline data management may be a bottle-neck

D.W. Miller (EFI, Chicago)

#### **Outline**

Challenges of the Energy and Luminosity Frontier

#### 2 ATLAS Phase I & II Hadronic Trigger Systems

3 Machine learning using FPGAs and MPSoCs



D.W. Miller (EFI, Chicago)

Go GLOBAL: global feature extraction trigger (gFEX) for ATLAS Run 3



D.W. Miller (EFI, Chicago)

#### **Go GLOBAL:** global feature extraction trigger (gFEX) for ATLAS Run 3



#### Goal

analyze event-level features for characteristics of moderate  $p_{\rm T}$  $(\sim 100$ 's of GeV) signatures of new and key physics processes

#### Go GLOBAL: global feature extraction trigger (gFEX) for ATLAS Run 3



#### Goal

analyze event-level features for characteristics of moderate  $p_{\rm T}$ (~100's of GeV) signatures of new and key physics processes

#### Strategy

input entire calorimeter **onto a** single trigger board

D.W. Miller (EFI, Chicago)

#### Go GLOBAL: global feature extraction trigger (gFEX) for ATLAS Run 3



#### Goal

analyze event-level features for characteristics of moderate  $p_{\rm T}$ (~100's of GeV) signatures of new and key physics processes

#### Strategy

input entire calorimeter **onto a** single trigger board

#### Tactics

- coarse towers  $(0.2 \times 0.2)$
- state-of-the-art FPGAs
- MPSoC for control, additional processing

D.W. Miller (EFI, Chicago)

## gFEX Performance for Run 3

- Signal: e.g. boosted tops
- Compare to Run 2 triggers



#### gFEX can efficiently identify jet structure at 300 GeV!

D.W. Miller (EFI, Chicago)

## The gFEX trigger design



• Implement new algorithms using state-of-the-art FPGAs + SoCs

Image-like event format is well-suited for computer vision & ML

D.W. Miller (EFI, Chicago)

#### The gFEX trigger design



Implement new algorithms using state-of-the-art FPGAs + SoCs
 Image-like event format is well-suited for computer vision & MI

D.W. Miller (EFI, Chicago)

### The gFEX trigger design



Implement new algorithms using state-of-the-art FPGAs + SoCs
Image-like event format is well-suited for computer vision & ML

D.W. Miller (EFI, Chicago)

### gFEX Design: Virtex 7 & Zynq UltraScale+



#### 2.3 Tb/s of calorimeter data received by gFEX

D.W. Miller (EFI, Chicago)

## gFEX Design: Virtex 7 & Zynq UltraScale+



## 2.3 Tb/s of calorimeter data received by gFEX

D.W. Miller (EFI, Chicago)

## gFEX Design: Virtex 7 & Zynq UltraScale+



## 2.3 Tb/s of calorimeter data received by gFEX

D.W. Miller (EFI, Chicago)

#### gFEX already recorded Stable Beams data in Run 2!

#### • gFEX recorded data Stable Beams data on Oct 16, 2018!

- Calorimeter back-end system is a prototype for the Phase I/II upgrade (Run 3 & 4)
- This is a major milestone, but there is certainly more to come





Run 3 ideas for Run 4 reality: Global Event Processor



Receives trigger object information from all systems (jets, electrons, muons, timing, and possibly tracks). Makes global trigger decision about the event.

Built from a *common module* with both a **Zynq and two processor FPGAs**.

D.W. Miller (EFI, Chicago)

#### Run 3 ideas for Run 4 reality: Hardware Track Triggers



# **Regional tracking at 1 MHz and global tracking at 100 kHz** accomplished with associative memory **ASICs** (AMTP) with tracking in **FPGAs** (SSTP)

D.W. Miller (EFI, Chicago)

Machine Learning for Future Triggers Systems

January 15, 2019 12/19

#### **Outline**

Challenges of the Energy and Luminosity Frontier

2 ATLAS Phase I & II Hadronic Trigger Systems

Machine learning using FPGAs and MPSoCs

4 Summary and conclusions

D.W. Miller (EFI, Chicago)

## Convolutional neural networks (CNN) for jet identification



Komiske, Metodiev, Schwartz (arXiv:1612.01551)

D.W. Miller (EFI, Chicago)

## Convolutional neural networks (CNN) for jet identification



Effectively the same CNN from Komiske, *et al.* can be used for top-tagging, using either high-level observables or jet images.



Komiske, Metodiev, Schwartz (arXiv:1612.01551)

Moore, Nordström, Varma, Fairbairn (arXiv:1807.04769)

(CNNs here use 4 layers, 64 filters in the conv. layers, and 128 node dense layer.)

D.W. Miller (EFI, Chicago)

There is a significant benefit to modern MPSoC devices:

- Execute high-level applications on CPU/RPU
- Perform low/fixed latency operations on FPGA
- Offload simple vector/matrix operations to GPU

There is a significant benefit to modern MPSoC devices:

- Execute high-level applications on CPU/RPU
- Perform low/fixed latency operations on FPGA
- Offload simple vector/matrix operations to GPU

And we can execute complex ML applications using CNN directly on these devices!

# There is a significant benefit to modern MPSoC devices:

- Execute high-level applications on CPU/RPU
- Perform low/fixed latency operations on FPGA
- Offload simple vector/matrix operations to GPU

And we can execute complex ML applications using CNN directly on these devices!



#### There is a significant benefit to modern MPSoC devices:

- Execute high-level applications on CPU/RPU
- Perform low/fixed latency operations on FPGA
- Offload simple vector/matrix operations to GPU

And we can execute complex ML applications using CNN directly on these devices!

Prune

Finetune

Pruning

(Less number of param)



15/19

D.W. Miller (EFI, Chicago)

Dense Neural Network

(FP32)

## Proof-of-principle with the gFEX Zynq: ResNet-50

Implement ResNet-50 neural network for image classification on our Zynq UltraScale+ MPSoC for gFEX

- $\rightarrow$  Dramatically larger network!
- $\rightarrow$  Thousands of filters
- $\rightarrow \ \sim 10 \text{ billion operations}!!$
- $\rightarrow$  merely a Proof-of-principle



Work conducted by Emily Smith (grad student), in collaboration with Giordon Stark (UC Santa Cruz) and two UChicago undergraduates Jack Huang, Ben Warren.

D.W. Miller (EFI, Chicago)

## Proof-of-principle with the gFEX Zynq: ResNet-50



In ResNet50 CONV layers ... DPU CONV Execution time: 13607us DPU CONV Performance: 566.62GOPS n ResNet50 FC lavers ... DPU FC Execution time: 236us DPU FC Performance: 16,9492GOPS op[0] prob = 0.993050 name = English setter prob = 0.001493 name = clumber, clumber spaniel op[2] prob = 0.001493 name = Brittany spaniel [3] prob = 0.001163 name = English springer, English springer spaniel op[4] prob = 0.000705 name = Great Pyrenees bad image : 2ILSVRC2012 test 00068213.JPEG un ResNet50 CONV layers ... DPU CONV Execution time: 13595us DPU CONV Performance: 567.12GOPS In ResNet50 FC lavers ... DPU FC Execution time: 236us DPU FC Performance: 16.9492GOPS op[0] prob = 0.915599 name = rock beauty, Holocanthus tricolor [1] prob = 0.075157 name = king penguin, Aptenodytes patagonica prob = 0.001768 name = anemone fish op[3] prob = 0.000835 name = fiddler crab op[4] prob = 0.000650 name = toucan bad image : 2ILSVRC2012 test 00042675.JPEG un ResNet50 CONV layers ... DPU CONV Execution time: 13586us DPU CONV Performance: 567,496GOPS un ResNet50 FC layers ... DPU FC Execution time: 235us DPU FC Performance: 17.0213GOPS [0] prob = 0.977076 name = jaguar, panther, Panthera onca, Felis onca prob = 0.017896 name = leopard, Panthera pardus prob = 0.000891 name = tiger, Panthera tigris prob = 0.000540 name = cheetah, chetah, Acinonyx jubatus op[4] prob = 0.000540 name = tiger cat



Image processing at the level of  $\mathcal{O}(ms)$ , expected to decrease to  $\mathcal{O}(\mu s)$  for jet network and  $30 \times 30$  "images" (i.e. gFEX events).

D.W. Miller (EFI, Chicago)

#### **Outline**

Challenges of the Energy and Luminosity Frontier

2 ATLAS Phase I & II Hadronic Trigger Systems

3 Machine learning using FPGAs and MPSoCs



D.W. Miller (EFI, Chicago)

#### Summary

- Major challenge to measurements and searches in hadronic final states at the future LHC will be triggering and data management
- Run 3 trigger systems (gFEX) have **unique and novel capabilities** as part of both baseline design and **ML on MPSoC & FPGAs** 
  - Co-processor applications using FPGA+CPU+GPU would be very interesting!
- Clear **opportunities for the Phase II trigger system** in terms of hadronic final state physics, tracking, and more for the trigger system currently planned
  - There is much to be explored in Hardware-based Track Triggers for HLT!
- Strong involvement with scalable systems, hardware accelerators, and even data management plans for the "offline" world may be essential to realize gains further in physics potential

#### **Outline**



D.W. Miller (EFI, Chicago)

Appendix



D.W. Miller (EFI, Chicago)

## gFEX prototypes and production boards



- **Prototype** (1×**VU9P** + 1×**ZU19**) used for integration and commissioning at CERN since Q1 2018.
- Final board (3×VU9P + 1×ZU19) delivered to CERN on 25 June, 2018, 5 years from proposal to delivery!
- Installation  $\sim$ now, ready for Run 3



D.W. Miller (EFI, Chicago)

## gFEX Multi-Processor System-on-Chip: Zynq Ultrascale+



D.W. Miller (EFI, Chicago)

#### gFEX Virtex 7 Ultrascale+ Processor FPGAs



D.W. Miller (EFI, Chicago)

## gFEX Multi-Processor System-on-Chip: Zynq Ultrascale+



D.W. Miller (EFI, Chicago)

## gFEX Multi-Processor System-on-Chip: Zynq Ultrascale+

#### Zynq® UltraScale+™ MPSoCs: EG Block Diagram



D.W. Miller (EFI, Chicago)

## Zynq Ultrascale+ Processors



#### 64 bit ARM quad-core processor



D.W. Miller (EFI, Chicago)

#### Processor and Zynq FPGA comparison for gFEX boards

|                 | Processor FPGA |       |       | Zynq  |       |
|-----------------|----------------|-------|-------|-------|-------|
| gFEX version    | v1             | v2/v3 | v3/v4 | v1/v2 | v3/v4 |
| FPGA type       | VX690T         | VU160 | VU9P  | Z7045 | ZU19  |
| Logic Cells (M) | 0.7            | 2.0   | 2.6   | 0.4   | 1.1   |
| CLB (M)         | 0.9            | 1.9   | 2.4   | 0.3   | 1.0   |
| Total RAM (Mb)  | 52.9           | 115.2 | 345.9 | 17.6  | 70.6  |
| DSP slices (K)  | 3.6            | 1.6   | 6.8   | 0.4   | 2.0   |

#### Global Event Processor Information



Zynq MPSoC also available on future Phase II trigger system.

D.W. Miller (EFI, Chicago)

#### Industrial neural networks and ResNet-50

From Canziani, Culurciello, Paszke "An Analysis of Deep Neural Network Models for Practical Applications" (arXiv:1605.07678)



"operations count represent a good estimation of inference time."

D.W. Miller (EFI, Chicago)