# Jets/MET with pileup and machine learning

### Nhan Tran Fermilab

Princeton HL-LHC Trigger Workshop January 16, 2018





## **INTRODUCTION AND OUTLINE**

### Last talk, Giovanni: **Particle Flow** - efficient combination of detector information to extract best physics performance

Building of the technology presented by Giovanni...

### This talk: more advanced algorithms

Dealing with pileup PUPPI proof-of-concept: jets, MET, jet substructure (?),... More sophistication with machine learning and HLS4ML



## HL-LHC AND PILEUP

### Multiple pp collisions in the same beam crossing To increase data rate, squeeze beams as much as possible



CMS Experiment at the LHC, CERN Data recorded: 2016-Sep-08 08:30:28.497920 GMT Run / Event / LS: 280327 / 55711771 / 67





## HL-LHC AND PILEUP

### Multiple pp collisions in the same beam crossing To increase data rate, squeeze beams as much as possible



CMS Experiment at the LHC, CERN Data recorded: 2016-Sep-08 08:30:28.497920 GMT Run / Event / LS: 280327 / 55711771 / 67

### Need sophisticated techniques to preserve the physics!



## PUPPI

### PUPPI (PileUp Per Particle Id): based on PF paradigm

particle is from PU

[1] define a local discriminant, a between pileup (PU) and leading vertex (LV)

**[2]** get data-driven a distribution for PU using charged PU tracks

a general framework that determines, per particle, weight for how likely a

key insight: using QCD ansatz to infer neutral pileup contribution

$$\alpha_i^C = \log \left[ \sum_{i \in \text{Ch.LV}} \frac{p_{T,j}}{\Delta R_{ij}} \Theta(R) \right]$$



## PUPPI

### PUPPI (PileUp Per Particle Id): based on PF paradigm

particle is from PU

[1] define a local discriminant, q between pileup (PU) and leading vertex (LV)

**[2] get data-driven a distribution for PU using** charged PU tracks

[3] for the neutrals, ask "how un-PU-like is a for this particle?", compute a weight

[4] reweight the four-vector of the particle by this weight, then proceed to interpret the event as usual

a general framework that determines, per particle, weight for how likely a

key insight: using QCD ansatz to infer neutral pileup contribution

$$\alpha_i^C = \log \left[ \sum_{j \in Ch, LV} \frac{p_{T,j}}{\Delta R_{ij}} \Theta(R) \right]$$





## PUPPI

### PUPPI (PileUp Per Particle Id): based on PF paradigm a general framework that determines, per particle, weight for how likely a particle is from PU key insight: using QCD ansatz to infer neutral pileup contribution







## **BEYOND JETS AND MET**

### **Trying to preserve soft, hidden physics** Things hidden in jets and jet substructure Isolated, soft leptons in high PU environments

### \* Examples plots from offline studies



Large gains is soft muon backgrounds and jet substructure



### **MPLEMENTATION**

synthesis (HLS) as well

**[1]** define a local discriminant, **a** between pileup (PU) and leading vertex (LV)

### Implementation of Puppi proof-of-concept using High level

COMPUTE FOR EACH NEUTRAL

, 
$$\alpha_i^C = \log \left[ \sum_{j \in Ch, LV} \frac{p_{T,j}}{\Delta R_{ij}} \Theta(R) \right]$$







### **MPLEMENTATION**

synthesis (HLS) as well

**[1]** define a local discriminant, **a** between pileup (PU) and leading vertex (LV)

**[2]** get data-driven a distribution for PU using charged PU tracks

[3] for the neutrals, ask "how un-PU-like is a for this particle?", compute a weight

[4] reweight the four-vector of the particle by this weight, then proceed to interpret the event as usual

### Implementation of Puppi proof-of-concept using High level

COMPUTE FOR EACH NEUTRAL

, 
$$\alpha_i^C = \log \left[ \sum_{j \in Ch, LV} \frac{p_{T,j}}{\Delta R_{ij}} \Theta(R) \right]$$

PRECOMPUTE STEP 2 OFFLINE WITH CONSTANTS (FOR GIVEN PILEUP LEVEL)

DO STEP 3/4 WITH A LOOK-UP TABLE

OF FPGA AND 100S OF NS LATENCY WITH LITTLE DEGRADATION IN PERFORMANCE







- RESOURCE USAGE ONLY FEW %

### PERFORMANCE

### First physics results on HT and MET triggers for CMS phase-2 trigger interim document



Gains in rate reduction, signal efficiency, lower thresholds

## PLANS AND OUTLOOK

# Proof-of-concept PF+PUPPI running at L1



Other advanced algorithms... how about machine learning?

Bringing advanced physics algorithms to the hardware trigger! large physics gains: HT, MET, jet (substructure), lepton isolation

## high level synthesis for machine learning HISFML HLS4ML

JENNIFER NGADIUBA, MAURIZIO PIERINI (CERN) JAVIER DUARTE, SERGO JINDARIANI, BEN KREIS, NHAN TRAN (FNAL) PHIL HARRIS (MIT) ZHENBIN WU (UIC)

+ EJ KREINAR (HAWKEYE 360) AND SONG HAN (GOOGLE/STANFORD)

## MACHINE LEARNING IN FPGAS

### Many parts of the trigger could benefit machine learning clustering, fitting (regression), classification, anomaly detection

### Not just LHC physics or triggering DAQ, neutrino physics, intensity frontier, ...

No industry solutions: LHC latency constraints are unheard of

### Why HLS?

HLS allows (super)-fast algorithm development

Write a tool for machine learning *inference*\* at low latencies: HLS4ML

\* for training, GPUs remain top dog

## **NN INFERENCE IN A NUTSHELL**



### Simple 2 input example (Fisher linear discriminant, linear support vector machine,...) $O_1 = I_1 \times W_{11} + I_2 \times W_{21} + b_1$

INPUT 7





INPUT 2

## **NN INFERENCE IN A NUTSHELL**



FULLY CONNECTED HIDDEN LAYER



### 15

### OUTPUTS

## **NN INFERENCE IN A NUTSHELL**



## (ENERGY) EFFICIENT NEURAL NETWORKS

**Compression/Pruning:** 

neurons (many schemes)

### **Quantization/Approximate math:**

32-bit floating point math is overkill 20-bit, 18-bit, ...? fixed point, integers? binarized NNs?

before pruning



For further reading, start here: https://arxiv.org/pdf/1510.00149v5.pdf

### Emergent engineering field, efficient implementation of NN architecture

- maintain the same performance while removing low weight synapses and



### PROJECT OVERVIEW



### PROJECT OVERVIEW



## **HLS4ML - TRANSLATION IN ONE LINE!**

python keras-to-hls.py -c keras-config.yml



**IOType**: parallelize or serialize **ReuseFactor**: how much to parallelize **DefaultPrecision**: self-explanatory :)

example-keras-model-files/KERAS\_1layer\_weights.h5





### Keras/TF inputs

## **EXAMPLE: JET SUBSTRUCTURE**

5 output multi-classifier:

## **Network architecture**

16 expert inputs jet masses, multiplicity ECFs ( $\beta = 0, 1, 2$ )



### Does a jet originate from a quark, gluon, W/Z boson, top quark?







## EXAMPLE: NETWORK (NOT JET) PRUNING

Resource usage: 92% DSP usage for Virtex 7 61 clocks (305 ns), Pipeline = 1

29% DSP usage for Virtex 7



### **MINI-SUMMARY**

## HLS4ML

a tool to translate ML algorithms for FPGAs in minutes highly parallelizable with user controls for resource usage and latency tunable precision, resource reuse very efficient network design with model compression

### Work in progress

Mapping out resource usage and latency as a function of neural network hyper parameters More network architectures: CNN (in progress), RNN/LSTM (tricky!), TMVA BDT (efficient?)

### **Status**

Alpha version - few weeks; Targeting March-April release of Beta version Please contact us if you are interested! <u>hls4ml.help@gmail.com</u>





### one more fun thing to think about for the high level trigger (and beyond?)

## **#TRENDING**







## **#TRENDING**

Specialized co-processor hardware for machine learning inference

|         |                             |                                       | × +                                  |
|---------|-----------------------------|---------------------------------------|--------------------------------------|
|         |                             | ady exister<br>example:               | ts!<br>Microsoft ca                  |
|         |                             | Data Source:                          | Wikipedia •                          |
|         |                             | Translate to:                         | spanish •                            |
|         |                             | Processor Type<br>Azure FPGA Server – | \$V4-D5-1U •                         |
|         |                             | Compute Capacity                      | 10T 100T                             |
|         |                             | Compute Cap                           | acity:                               |
| CITE OF |                             | Estimated Tin                         | ne:                                  |
|         |                             | Pages Per Second:                     |                                      |
|         | Transl<br>~0(10             | ation of a<br><b>0) times</b>         | all of wikiped<br><b>faster than</b> |
| INTE    | L <sup>®</sup> FPGA ACCELER | ATION HUB                             |                                      |

The Intel® Xeon® Acceleration Stack for FPGAs is a robust framework enabling data center applications to leverage an FPGA's potential to increase





## ML BABEL FISH



which can universally be expressed on

a common *language* for solving problems optimized computing hardware and follow industry trends

Large gains from hardware accelerating co-processors Industry trending towards specialized computing paradigms





## SUMMARY AND OUTLOOK

Recent advances in hardware and compilation/synthesis allow for sophisticated techniques at low latency

> Big improvements in performance, preserve soft and hidden signatures

Proof-of-concept holistic pileup mitigation techniques such as PUPPI Efficient machine learning at Level-1 Trigger New paradigms for HLT and offline?

25

## BONUS



### **ReuseFactor**: how much to parallelize operations a hidden layer

### # of multiplications per clock (DSPs usage)



time





(decreasing throughput)



## EXAMPLE: NETWORK (NOT JET) PRUNING

Resource usage: 92% DSP usage for Virtex 7 61 clocks (305 ns), Pipeline = 1

29% DSP usage for Virtex 7



## EXAMPLE: QUANTIZATION

### Take a simple 1-layer network and scan in input/weight precision Reduced precision can greatly reduce resource usage e.g. factor of 4 reduction with 18 instead of 32 bits with minimal loss in performance





### UNDER THE HOOD



| DSP  | FF    | LUT  |
|------|-------|------|
| 3329 | 95924 | 8127 |
| 92   | 11    | 18   |

## THE COMPUTING CHALLENGE



### Major HLT and computing challenges going forward!

### **Current:** ~5 minutes per **HL-LHC** event **100 times the** data...

exabytes!



## MOORE'S LAW AND DENNARD SCALING



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

## MOORE'S LAW AND DENNARD SCALING



### Single threaded performance not improving Circa ~2005: "The Era of Multicore" → Today: Transition to the "Era of Specialization"? (c.f. Doug Burger)

CPU

MULTIPLE CORES

### GPU THOUSANDS OF CORES

|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   | - |   |  |    |   |   |    |   |   |   |   |   |    |
|---|---|---|---|---|---|---|---|---|---|---|---|---|----|-----|---|---|---|--|----|---|---|----|---|---|---|---|---|----|
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    | Т |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    | Т |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   | т |   |   |    |     |   |   |   |  |    |   |   |    | T |   |   |   |   | п  |
|   |   |   |   |   |   |   |   |   |   | Т |   |   |    |     |   |   |   |  |    |   |   |    | Т |   | Т |   |   |    |
|   |   |   |   |   |   |   |   |   |   | т |   |   |    |     |   |   |   |  |    |   |   |    | Т |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   | т |   |   |    |     |   | - |   |  |    |   |   |    | т |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    | т |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   | - |    |     |   | - |   |  |    |   |   |    | т |   |   |   |   |    |
|   |   | = |   |   |   |   |   |   |   |   |   |   |    |     |   | - |   |  |    |   |   |    | т |   |   |   | - |    |
|   |   |   |   |   |   |   |   |   |   | T |   |   |    |     |   |   | = |  |    |   |   |    | 1 |   |   | - |   |    |
|   |   |   |   |   |   |   |   |   | T |   |   |   |    |     |   |   |   |  |    |   |   |    | I |   |   |   |   |    |
| I |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   | ÷ | 11 | T | T |   |   |   |    |
| T |   | - | - |   |   |   |   |   |   | T |   |   | =  |     |   | - |   |  |    | - |   |    | T |   |   |   |   | ÷  |
| 1 |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   | ٠ |    |   |   |   |   |   |    |
| 1 |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
| T |   |   |   |   |   |   |   |   |   | т |   |   |    |     |   |   |   |  | 11 |   |   |    | T | Т |   |   |   |    |
| T |   |   |   |   |   | T |   |   |   | T |   |   |    |     |   |   |   |  |    |   |   |    | T | T |   |   |   |    |
| Т |   |   |   |   |   | T |   |   |   | T |   |   |    |     |   |   |   |  |    |   |   |    | T |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
| I |   | = |   |   | - |   |   |   |   | Т |   |   |    | . = |   |   |   |  |    |   | - |    |   |   |   |   | - |    |
| 1 | T | = |   |   |   |   |   |   |   |   |   |   |    |     |   | - | - |  |    |   |   |    |   | Т |   |   | - |    |
| I |   | = | - |   | = |   |   |   |   |   |   |   |    | -   |   | = | = |  |    |   |   |    |   |   |   |   | - |    |
| 1 |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    | 1 |   |   |   |   |    |
| l |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   | đ |  |    |   |   |    | T |   |   |   |   | 1  |
| T |   |   |   |   |   |   |   |   |   | Т |   |   |    |     |   | - |   |  |    |   |   |    | T |   |   |   |   |    |
| I | Т |   |   |   |   |   |   |   |   |   |   |   | -  |     |   | - | - |  |    |   |   |    | Т | т |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   | t |   |   | i. |     | ü | i | _ |  |    |   |   |    |   | Т |   |   |   | н. |
|   |   |   |   |   |   |   |   |   | ł | Ŧ | F | i | i  |     | - | i | _ |  |    |   |   | -  |   | ł | ł | _ |   | ï  |
| t |   |   |   | _ |   |   |   |   |   | ŧ | ļ |   | _  |     |   |   | _ |  |    | H |   |    |   |   |   | _ |   | =  |
| İ |   |   |   |   |   |   |   | H |   |   |   |   | ī  |     |   |   |   |  |    |   |   |    |   |   |   | _ |   | =  |
| İ |   |   |   | ł |   |   | E |   | ł |   |   |   | ī  |     |   |   |   |  |    |   |   | 1  |   |   |   | _ |   | =  |
|   |   |   |   |   |   |   | E |   |   |   |   |   | ī  |     |   |   |   |  |    |   |   |    |   |   |   | _ |   | =  |
|   |   |   |   |   |   |   | E |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   | _ |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   |   |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |   |   | ł |   |    |     |   |   |   |  |    |   |   |    |   |   |   |   |   |    |

32

## ARCHITECTURES



Source: Bob Broderson, Berkeley Wireless group

\* GPUs still best option for training
\* FPGAs generally much more power efficient

