# News

 * Today's meeting on Multi-Class an Multi-Objective problems
 * Next meeting on Parallelisation (and possibly other topics) of Machine learning methods
                   Friday 24 Feb 2017
                   Salle Curie (40-S2-C01)
 * In March there'll be an IML workshop https://indico.cern.ch/event/595059/
   20-22 March 2017
   the program will be:
    *  tutorials (TMVA, scikit-learn, R, keras)
    *  HSF community white paper discussion on Software and Tools
    *  Identification and Tagging of Physics Objects
    *  Hands-on session (benchmark dataset)
    *  ML tools from industry
 *  other events
    *  HEP software foundation workshop next week in San Diego
    *  DS@HEP workshop 8-12 May 2017, FNAL

Stefan Wunsch, Multi-Class Classification Methodology and Application in HEP

 *   TMVA: few methods available for multiclass classification in the early days of TMVA much more satisfactory at the moment
    *   weak point: missing fisher LDA for multi class problems
 *   Choice of the working point
    *   Typically, in binary classification one first trains then select the working point based on a ROC curve
    *   This is not possible in multiclass (more than 1 degree of freedom)
    *   Solution: chose working point at training stage using event weight (e.g. high signal efficiency vs high background rejection)
    *   In the case of TMVA's genetic fitter cuts, the working point is incorporated directly in the loss function
 *   Evaluation metrics
    *   Global accuracy often not sufficient. Confusion matrix (aka migration matrix) more relevant
    *   There are various possible representations of the migration  matrix, the two most common one:
       *   purity representation (listing sample purities by
           normalising to classified classes)
       *   efficiency representation (listing selection efficiencies by
           normalising to true classes)
 *   Application: event classification (H → τ τ vs several BG channels)
    *   Slide 16: A comparison of the workflow of a standard analysis and of a MVA analysis is shown
    *   Slide 17: Shows an example of mass distributions enriched of specific BG processes with by MVA cut
    *   Slide 18: compare results from traditional analysis and for MVA analysis
 *   Questions
    *   Francesco: how sensitive are the results to the training strategy? How sensitive is the method on the final training uncertainty?
       *   For this problem, MC matches pretty well the data
       *   Uncertainties related  to the possible selection of extreme regions in the phase space will be included in the study of uncertainties
    *   Sergei: Tools: what did you use?
       *   TMVA
    *   Michele: A lot of the signal is killed by MVA, so final uncertainty will depend on how well you control the systematics on the background.
       *   Yes, but this is what we want: at the end we are still enhancing the significance

Tatiana Likhomanenko, Multiclass classification application: LHCb Particle Identification

 *   ML used to assign particle identity to tracks (Ghost, Electron, Muon, Pion, Kaon, Proton)
 *   Comparison of binary and multiclass methods discussed in the slides
 *   Quality measures
    *   ROC curve build for one class vs all the others (there will be multiple ROC curves)
    *   Area under the curve can be used to build a matrix to compare discrimination of different classes
 *   Neural networks for multiclass classification
    *   Complexity of NN is almost the same for multiclass and binary classification
    *   3 layers NN with Keras is used
    *   DN improves results wrt one-vs-rest binary classification
       *   Multiclass uses the full information and provides probabilistic interpretation
    *   Topology stacking sub-networks dedicated to different sub detectors are discussed
    *   DL needs millions of events for training
 *   BDTs for multi class
    *   Need to add linear combination of initial features
    *   One vs Rest and multiclass are compared
    *   Performance similar to NN
 *   Flat models
    *   PID efficiency in real life depends on pt
    *   Can be flattened e.g. with uniform boosting BDT
    *   Flattening does not reduce significantly reduces overall performance
    *   Flattening is enforced in the loss function
 *   Questions
    *   How can you train over full pt? (performance and priors change over pt)
       *   uniform boosting (https://arxiv.org/abs/1305.7248 / https://arxiv.org/abs/1410.4140)
           is designed to do this. The training is done over the full pt range, and the loss function
           penalizes efficiency variations as function of pt. (the background rejection and thus discrimination
           in a pt slice can still vary)
    *   Sergei question: did you compare the deep multi-class with shallow multi-class vs single class
        * Answer yes:
   *  Sergei question: how much of the performance boost due to depth
        * Answer: non-negligible       
    *   Batch normalization and drop-out: did you use them at the same time?
       *   Yes, these are standard tricks

Sergei Gleyzer, Multiobjective regression methods and applications in HEP

 *   A short review of classification and regression within ML is presented
 *   For regression the optimization criterion is minimum variance
 *   Example to estimate photon energy is discussed
 *   Multiobjective regression has to take into account target correlations: several single-target models are not optimal for this problem.
 *   Several methods can be applied to these problems. Trade offs between accuracy, model size, interpretability to be taken into account
 *   Simple example discussed, shows that correlations can be preserved
 *   Several potential applications, e.g. fast non-parametric simulation

Alexander Radovic, Multi-Class Classification in NOvA

 *   To study neutrino oscillation need to identify incoming neutrino and type of process
 *   In NOvA, the 3 neutrino flavors are relevant
 *   The relevant event topologies are reviewed in the slides
 *   The input to the method are pairs of images (X and Y view), analyzed with a deep NN
 *   Technicalities
    *   13 target classes
    *   4.7 M simulated events used (80/20 share for training and testing)
    *   Energy compressed to 8 bit
 *   A deep CNN architecture is used, no sign of overtraining seen
 *   Confusion matrix and t-SNE are shown
 *   t-SNE shows slight difference between truth and reconstructed labels, indicates where the model has difficulties in distinguishing different classes
 *   Questions
    *   What is used for visualization? scikit-learn
    *   Sergei: what are the exact inputs? (literal vision problem or a variant)
        * Answer: variant