# NEW IDEAS & TECHNOLOGIES IN TRIGGER FOR HL-LHC (AND RUN 3)

#### JAVIER DUARTE (UCSD) OCTOBER 23, 2019 WEST COAST LHC JAMBOREE, SLAC





 Modern FPGAs with large amounts of embedded components that perform multiplication (DSPs), apply logical functions (LUTs), or store memory (BRAM)



- Modern FPGAs with large amounts of embedded components that perform multiplication (DSPs), apply logical functions (LUTs), or store memory (BRAM)
- High level synthesis to more easily program FPGAs



- Modern FPGAs with large amounts of embedded components that perform multiplication (DSPs), apply logical functions (LUTs), or store memory (BRAM)
- High level synthesis to more easily program FPGAs
- Sophisticated algorithms



Modern FPGAs with large amounts of embedded components that perform multiplication (DSPs), apply logical functions (LUTs), or store memory (BRAM)



- Sophisticated algorithms
- Machine learning



Modern FPGAs with large amounts of embedded components that perform multiplication (DSPs), apply logical functions (LUTs), or store memory (BRAM)



- High level synthesis to more easily program FPGAs
- Sophisticated algorithms
- Machine learning
- GPUs or FPGAs or ASICs as **co-processors** for software trigger



**Challenges:** Each collision produces O(10<sup>3</sup>) particles The detectors have O(10<sup>8</sup>) sensors Extreme data rates of O(100 TB/s)



#### **Challenges:**

Each collision produces O(10<sup>3</sup>) particles The detectors have O(10<sup>8</sup>) sensors Extreme data rates of O(100 TB/s)



#### **Challenges:**

Each collision produces O(10<sup>3</sup>) particles The detectors have O(10<sup>8</sup>) sensors Extreme data rates of O(100 TB/s)





#### **Challenges:**

Each collision produces O(10<sup>3</sup>) particles The detectors have O(10<sup>8</sup>) sensors Extreme data rates of O(100 TB/s)







# **LEVEL-1 TRACK TRIGGER**

- Algorithm approach: tracklet and Kalman filter hybrid algorithm written in Vivado HLS to expedite development
  - Tracks are seeded with pairs of stubs in adjacent layers
  - Projections to other layers are calculated (assuming beamline constraint)
  - Full tracks after duplicate removal are inputs to the final track fit (Kalman filter)
- R&D efforts: displaced tracking for long-lived particles, etc.



| InputRouter        |
|--------------------|
| VMRouter           |
| TrackletEngine     |
| TrackletCalculator |
| ProjectionRouter   |
| MatchEngine        |
| MatchCalculator    |
| DuplicateRemoval   |
| KalmanFilter       |

- Correlator layer 1 will process pileup mitigated candidates {μ,e,γ,h<sup>±</sup>,h<sup>0</sup>,vtx}
- Full correlator trigger must complete all processing & transmit trigger objects {μ,e,γ,τ,j,MET,etc.} to the GT within 2.5 μs



## MACHINE LEARNING IN FPGAS WITH HLS4ML JINST 13 (2018) P07027 6

hls4ml for physicists or ML experts to translate ML algorithms into FPGA firmware



## MACHINE LEARNING IN FPGAS WITH HLS4ML JINST 13 (2018) P07027 6

hls4ml for physicists or ML experts to translate ML algorithms into FPGA firmware



#### MACHINE LEARNING IN FPGAS WITH HLS4ML JINST 13 (2018) P07027

**hls4ml** for physicists or ML experts to translate **ML algorithms** into FPGA firmware





hls4ml convert -c keras-config.yml







- **Precision**: inputs, weights, biases
- Strategy:
  - Resource for large NN
  - Latency for small NN (fully pipelined)



 Latency for small NN (fully pipelined)



#### **NETWORK TUNING: COMPRESSION & RESOURCES**



| X0Y4    | X1Y4 | X2Y4 | ХЗҮ4 | Х474<br>.0 | X5Y4 D |
|---------|------|------|------|------------|--------|
| X0Y3    | X1Y3 | Х2Ү3 | ХЗҮЗ | Х4Ү3       | X573   |
| Xovt    |      |      |      | X4Y2       | Х5Ү2   |
|         |      |      |      | X4Y1       | X5Y1   |
| XTTY OL |      |      | X3MT | X4Y0       | X5Y0   |



#### **NETWORK TUNING: COMPRESSION & RESOURCES**



| X0Y4                                     | X1Y4 | Х274 | ХЗҮ4 | Х4Ү4<br>.0   | X574 J |  |  |
|------------------------------------------|------|------|------|--------------|--------|--|--|
| X0Y3                                     | X1Y3 | Х2Ү3 | X3Y3 | Х4Ү3         | X5Y3   |  |  |
| X0X                                      |      |      |      | X4Y2         | X5Y2   |  |  |
|                                          |      |      |      | (4Y1         | (SY1   |  |  |
| 0X0                                      |      |      | 310. | (4Y <b>0</b> | (570   |  |  |
| ap_fixed <width,integer></width,integer> |      |      |      |              |        |  |  |

fractional

width

integer

Big reduction in DSPs (multipliers) with compression

#### **NETWORK TUNING: COMPRESSION & RESOURCES**





fractional

width

integer

- Big reduction in DSPs (multipliers) with compression
- Easily fits on 1 FPGA after compression

#### **NETWORK TUNING: PARALLELIZATION & TIMING**

#### Increasing reuse factor, increases latency



#### **NETWORK TUNING: PARALLELIZATION & TIMING**

#### Increasing reuse factor, increases latency



For low-latency, small reuse factor, inference in O(100 ns)! What if we have O(ms)? Can go to **bigger networks!** 

Inference of ML algorithms possible in O(100 ns) on 1 FPGA with hls4ml!

- Inference of ML algorithms possible in O(100 ns) on 1 FPGA with hls4ml!
  - Applications across CMS, ATLAS, DUNE, and accelerator controls:

- Inference of ML algorithms possible in O(100 ns) on 1 FPGA with hls4ml!
  - Applications across CMS, ATLAS, DUNE, and accelerator controls:
    - Muon p<sub>T</sub> determination in the CMS endcap with a DNN: runs in 160 ns on an FPGA and reduces the fake muon rate by up to 80%

- Inference of ML algorithms possible in O(100 ns) on 1 FPGA with hls4ml!
  - Applications across CMS, ATLAS, DUNE, and accelerator controls:
    - Muon p<sub>T</sub> determination in the CMS endcap with a DNN: runs in 160 ns on an FPGA and reduces the fake muon rate by up to 80%
    - Variational autoencoder for anomaly detection



- Inference of ML algorithms possible in O(100 ns) on 1 FPGA with hls4ml!
  - Applications across CMS, ATLAS, DUNE, and accelerator controls:
    - Muon p<sub>T</sub> determination in the CMS endcap with a DNN: runs in 160 ns on an FPGA and reduces the fake muon rate by up to 80%
    - Variational autoencoder for anomaly detection



- Currently supported:
  - Small and large dense NNs
  - Binary and ternary NNs
  - Small 1D/2D CNNs
- Planned support
  - Big 1D/2D CNNs
  - Graph NNs
  - Other HLS/RTL backends

## **CO-PROCESSORS**

# Specialized co-processor hardware for machine learning inference





INTEL<sup>®</sup> FPGA ACCELERATION HUB

# FPGA Catapult/Brainwave

#### Delivering FPGA Partner Solutions on AWS

via AWS Marketplace Customers FPGA AWS Marketplace Amazon Amazon FPGA Image Machine (AFI) Image (AMI) AFI is secured, encrypted, dynamically loaded into the FPGA - can't be copied or Amazon EC2 FPGA downloaded

Google Tensor Processing Unit ASIC



Deployment via Marketplace





## **CO-PROCESSORS**

Specialized co-processor hardware for machine learning inference







- Services for Optimized Network
  Inference on Coprocessors
  (SONIC)
  - Send jet images from CMSSW to Microsoft Brainwave FPGA





- Services for Optimized Network
  Inference on Coprocessors
  (SONIC)
  - Send jet images from CMSSW to Microsoft Brainwave FPGA
- Two modes: cloud service and on premises

#### **SONIC LATENCY**



#### **SONIC LATENCY**



- Remote: FNAL (IL) to Azure (VA) <i time> = 60 ms
  - Highly dependent on network conditions

## **SONIC LATENCY**



- Remote: FNAL (IL) to Azure (VA) <i time> = 60 ms
  - Highly dependent on network conditions
- On-prem: run CMSSW on Azure

<time> = 10 ms

- on FPGA: 1.8 ms for inference
- Remaining time used for classifying and I/O

# SONIC+BRAINWAVE IN LHC COMPUTING





- Brainwave + SONIC achieves
  - 175×(30×) on-prem (remote) better latency vs. CMS CPU
  - I FPGA service can serve 100s of CPU worker nodes
    - **Competitive throughput** vs. GPU as a service



 HCAL reconstruction and tracking contribute significantly to HLT compute time



- HCAL reconstruction and tracking contribute significantly to HLT compute time
- GPU/FPGA as co-processor can reduce compute time

Inputs

TS0 TS1

TS2

TS3

TS4

TS5

TS6 TS7 iη iφ depth 0000000



- HCAL reconstruction and tracking contribute significantly to HLT compute time
- GPU/FPGA as co-processor can reduce compute time
  - Patatrack pixel reconstruction on GPUs





- HCAL reconstruction and tracking contribute significantly to HLT compute time
- GPU/FPGA as co-processor can reduce compute time
  - Patatrack pixel reconstruction on GPUs
  - HCAL reconstruction with ML on GPUs/FPGAs (as a service)





# LHCB HIGH-LEVEL TRIGGER ON GPUS

- By 2021, full LHCb trigger chain in software (HLT)
- Run full first stage of HLT (HLT1) on GPUs
- One GPU has to process 30/60 k events/s
- The current sequence of full Velo, primary vertices, full UT, and SciFi decoding runs on an NVIDIA V100 at 112 kHz





#### SUMMARY AND OUTLOOK

- > Particle physics experiments face **extreme trigger challenges** in the coming years
- Exploiting new algorithms, new hardware, and machine learning will be key to the success of next-gen experiments
- Open questions:
  - With more sophisticated algorithms at earlier trigger, how do we ensure performance/safety? backup triggers?
  - > What community tools do we need to deploy ML at the trigger?
  - > Which co-processors are best suited to which tasks for the high-level trigger?
  - How do we incorporate timing information at the trigger level?
  - What are the physics use-cases for L1 scouting at 40 MHz?
  - What can we do with the new trigger hardware capabilities which we aren't thinking about?
  - L1 gives us a fundamental limitation but is there more we can exploit at the HLT?
  - Can we make a (realistic) wish-list for triggerable characteristics of events?



#### JAVIER DUARTE (UCSD) OCTOBER 23, 2019 WEST COAST LHC JAMBOREE, SLAC

# BACKUP

#### MUON RECONSTRUCTION WITH KALMAN FILTER



- Phase-2 : improved reconstruction using a Kalman filter
  - Iterative outer-inner tracking to reconstruct tracks and assign track p<sub>T</sub> (as offline)
- Both PV constrained and unconstrained tracks: displaced standalone muons





#### **UPGRADING THE LEVEL-1 TRIGGER (BEFORE)**



## **UPGRADING THE LEVEL-1 TRIGGER (AFTER)**

More and better information available in the Level-1 trigger!

What can we do with it?

21

