IML Machine Learning Working Group: trigger applications of ML

Europe/Zurich
40/S2-C01 - Salle Curie (CERN)

40/S2-C01 - Salle Curie

CERN

115
Show room on map
Webcast
There is a live webcast for this event

46 people in the room
22 on vidyo
unknown webcast

# NEWS

 - kick off meeting of CMS ML forum took place
 - hep software foundation community white paper:
    * large step forward at workshop in annecy (https://indico.cern.ch/event/613093/)
    * still subject to editing, the entire community is invited to contribute
    * to be presented to the community at ACAT (3rd week of August)
 - MLHEP summer school next week at imperial with a challenge
    * https://indico.cern.ch/event/613571/
 - next fermilab ML meeting tomorrow
    * https://indico.fnal.gov/conferenceDisplay.py?confId=14818

# anomaly detection (Siraj Raval; data scientist / youtuber / special guest)
recording on [cds](http://cds.cern.ch/record/2274402)

 - example will be credit card fraud, but principle generic and should work with anomaly detection in detector operations just as well
 - work with auto encoder
   * auto encoder finds a multidimensional space to represent data
   * data samples with large errors in the training are the outliers of the normal phase space of regular data
   * usually dimensionality reduction done with e.g. PCA
     not talking so much about deep learning
 - neural network used because of its property as universal function approximator
 - example features for credit card data
   * when logged in
   * from where
   * amount of money
 - suggested usages for NN at CERN
   * supervised learning for classification
   * GAN
   * clustering
 - recommendations from the expert
   * use deep learning every day! CERN is perfect
     + CERN has enough DATA
     + CERN has computers
     + add Neural Nets
     => the trinity of deep learning
   * use tensorflow
   * (use pytorch for building new things)
 - scikit learn for data sample splitting
 - normalise data to be roughly -3 .. 3 scale
 - activation function: ReLU "objectively best" at the moment
 - training with gradient descent to minimise loss function as function of weight values
   <explains difference gradient descent vs. stochastic gradient descent>
   <back propagation == gradient descent>
 - auto encoder uses as TARGET = LABEL the input
   i.e. F(x) = x is what the network should learn
 - loss function RMSE
   i.e. minimise Σ_i(x_i-F(x_i))^2
    * single scalar for the entire dataset
    * anomalies will in the end have large
      (x - F(x))^2
      and conversely assume that data points with large (x-F(x))^2 are credit card fraud
   this is essentially a supervised learning process [side remark, work with unsupervised learning, e.g. with generative models a hot field]
   <cost == error == loss>
 QUESTIONS:
   - what are typical false-positive rates
     = usually tough problem in unsupervised learning
     = wouldn't put anything into production with less than 95% error rate
   - how would you recommend dealing with many qualitative features
     = convert qualitative features into something quantitative
     = use qualitative features as class definition, to define a multiclass problem
   - what if anomalies are not really different from "normal" data
     = that's a real problem
     = possible to apply PCA methods (see tomorrow's youtube video)
   - i was missing that you didn't do any singing
     = i gotta read papers to try and make me smarter i train my models in the cloud now …
       subscribe if you wanna learn now!
       (will be released on the weekend)
   - how would you optimise the hyper parameters
     = anything that is better than grid search
     = general problem and field of research
     = bayesian optimisation way to go. makes hyper parameters differentiable
     = introduces stochasticity into the model itself
   - had problem with network traffic anomaly detection, people were able to make their traffic look normal
     = often adding layers helps, but there is an upper bound
     = but in practise 1-10 layers seem optimal
     = keras should give an easy frontend to optimise the model
   - suggest working on cern open data. would be great if people outside CERN analyse our data. try their knowledge on CERN data.
   - often if we put lots of work into deep neural networks, hard to convince the collaboration that we know what's going on inside the NN. do you have an idea how to address that concern?
     = interpretability is hard and at the edge of research


# an EoS meter of QCD transition from DL   

 - there are locally connected, convolutional, and fully connected networks
 - convolutional nn successful in image recognition
 - example application can be: use DNN to learn solutions of schrödinger equation
 - heavy ion collision has many model parameters and experimental data which are complicated to match
 - can apply bayesian optimisation
 - CNN should be able to extract high order correlation from detected final state to extract the initial state of the collision
 - labelled data from simulated collision
 - distinguish cross over transition from 1st order phase transition from event data
 - 20% dropout, parametrised ReLU activation
 - prediction accuracy increases with larger training data
 - investigate what the network learned
   * prediction difference analysis
     + by what changes a single prediction under the exchange of a single pixel
   * importance map for testing data
     + different for different classes
 - network insensitive to initial state energy fluctuations
QUESTIONS:
  - will you look into real data
    * not available for analysis atm

# LHCb trigger
 - trigger is a real time data selection
 - in LHCb upgrade (2020) scenarioes, the rate of interesting event will very well be above 1MHz
   → trigger will not select a few interesting events from the background. instead many events are interesting
 - key features of the experimental apparatus wrt atlas/cms
   * forward spectrometer
   * dipol magnet
   * RICH detectors for particle identification
 - will remove hardware trigger stage because it's the least efficient
   → read out the entire detector at 30MHz send data to computer farm (also serves as buffer, so latency requirement lifted)
 - perform offline reconstruction in the trigger after short buffering for detector alignment and calibration
 - tried ML in the hardware trigger in the past, but with current detector / current hardware trigger not much to be gained
 - currently: essentially half of the trigger rate comes from high level machine learning algorithms, and about 2/3 of all physics papers
 - tools from latest optimisation campaign public on https://arogozhnikov.github.io/hep_ml/
 - key concerns for trigger applications:
   * ensure stability
   * how to evaluate instantaneous
   * feature building no concern at the moment because full reconstruction runs in the trigger
 - solved in 2010 with "bonsai" BDT
   * BDT is effectively a binned selection (in some n-dim feature space with variable bin size)
   * pick bin size yourself with bins larger than detector resolution
     -> few bins (feasible lookup table)
     -> expected to be stable over the year
 - BDT lookup table in shared memory, so 50MB size no concern in multithreaded application
 - setup to create bonsai BDT and lookup tables available and usable for newcomers
 - that was 2010; since then lots of profit from the collaboration with YSDA
 - since then included more signal modes, but principle to prune down to a simple classifier stays
 - another concern is if a classifier biasses the analysis because of its dependence with kinematics
   (similar to pivoting with adversarial networks; already in publications in 2015 for physics, and now in the trigger and particle identification)

 - tracking
   * what costs is combinatorics with bad candidates
   * now DNN pick the right seeds to do tracking before
   * in the offline reconstruction and software trigger since 2016
 - finally another NN to pick final track candidates after track fit, significant improvement over kalman filter fit

 - demonstration what real time data analysis means
   dimuon spectrum straight from the trigger (only di-muon pairs stored for analysis)
   200M candidates from the full 2016 data sample, pure selection thanks to NN for PID from all detectors

 - now, detector performance stability no concern anymore as calibration runs fully automatic in real time
 - e.g. refrective index calibration runs hourly
 - updating the alignment could take weeks in 2010, now down to 8 minutes

 - high LHC cross section have somewhat helped to adopt ML "everywhere" because there was no other way
 - run 2 ML assists classical feature building everywhere
 - run 3 work on "replace feature building by ML"

 - run 3 will go from 50 GB/s → 5 TB/s

 - QUESTIONS
   * what are the challenges
     + will we be able to do the tracking (essentially kalman filter) within the budget
     + can ML help (identify tracks before fitting? image recognition in the rich?)
     + combinatorics (build all possible 4-body combinations from hundreds of tracks in the event)
       → reducible with powerful particle identification
       → can parallelisation do this efficiently
       → can ML identify combinations

# andrew application of ML in the L1 trigger of CMS
(cf. previous presentations by Andrew on 23rd Nov https://indico.cern.ch/event/571105)

 - lots of low pT muons, mean large rate increase if you cut too loose
 - want to cut accurately in trigger not to cut too tight on physics
 → want accurate pt assignment and implement in hardware
 - similar to previous talk
    * discretise features
    * create lookup table (2GB)
    * difference: pT regression, instead of signal/background classification
 - implemented/deployed for 2016/17
 - QUESTIONS
    - what are the inputs?
      Δη, Δφ, mostly pre-calculated / easy calculated
    - which FPGA can hold 2GB?
      use another board for the table
    - how about floating point operations? these should lead to DSP locks
      no FP operations, all precalculated
    - stability wrt. pile up?
    - how many features are actually used
      less than 10
    - did you try pruning to reduce the number of features with acceptable performance loss
      picked indeed only the most important ones
      and monitor changes of pt spectrum under addition or removal of features
    - happy to see the LHCb approach of discretisation is also used by CMS with success

There are minutes attached to this event. Show them.