IML meeting: July 5, 2016
Peak people on Vidyo: 24
Peak people in the room: 25

Lorenzo: Intro and news
    - Today's meeting dedicated to anomaly detection
        - Particularly important for monitoring data quality in experiments
    - Next meeting is August 25th, focusing on unsupervised learning
    - Link to the next meeting's agenda is in the slides

James: Anomaly detection in ATLAS and RAMP at HSF
    - Different people have different views of anomaly detection
        - Automatic detection of events which are somehow different than the bulk of the data
        - Therefore these events are worth the attention/scrutiny of experts
    - Can be supervised: train to recognize specific anomalous cases
    - Can be semi-supervised: train on bulk without anomalies, strong relation to one-class classification
    - Can be unsupervised: automatically identify the bulk by some means and thus identify anomalies
    - Three kinds of anomalies
        - Point: a few points outside of the main bulk
        - Contextual: the point at which a given value is observed is anomalous, but otherwise could have been a sensible value
        - Collective: population-level differences
    - Useful for monitoring and detection of problems in several areas (DAQ/trigger, distributed computing, reco and DQ)
    - Physics analysis to look for unusual events (points) or collective behavior
    - Suppose you have two samples that are supposed to be statistically identical
        - Two MC samples, two different data runs, etc
        - How can you verify that A and B are identical?
        - Standard approach: overlay histograms of specific variables, look for differences
        - ML approach: train classifier to distinguish A from B, histogram the score, check difference
        - Results in one distribution to check rather than thousands
        - If classifier is able to separate A and B, then there must be a difference between the samples
    - RAMP = Rapid Analytics and Model Prototyping
        - Real-time data challenge, idea developed by Paris-Saclay
        - Participants in same room, or at least working in real time
        - Once submitted, code can be submitted and cloned by others (competitive-collaborative environment)
        - Makes use of IPython and jupyter notebooks
    - Anomaly detection RAMP at HSF
        - Around 30 participants
        - "Reference dataset" was subset of HiggsML dataset
        - "Distorted dataset" was distorted version of a different subset of HiggsML dataset
        - Chose a performance metric of the area under the ROC (AUC)
        - Leaderboard shows progression, and clear benefit from a new variable then picked up by other people
    - RAMP-style competitions can be very productive and lead to rapid developments
    - Switching topic, ATLAS activities fall into three main topics
        - DAQ: contextual anomaly detection in a time series
        - Distributed computing
        - Data quality monitoring and physics analysis (one class classification)
    - DAQ: NARX neural network trained to predict corridor for which the next point in a time series should be within
    - DC: something is going to be down at all points
        - Traditional approach: keep re-trying jobs until they work (inefficient, unpredictable delays)
        - ML could be applied to guide application of novel fault tolerance strategies
            - Determine when a retry is needed, and when it's just not going to work and should be moved to another site
        - Significantly reduce turnaround times for production
        - Joint WLCG demonstrator project with LHCb
    - DQ: two datasets that should be statistically compatible, but are they?
        - First approach is similar to what was done in the RAMP
        - Second approach is one-class classifier that only sees reference data (semi-supervised)
        - Second approach provides natural quantification of the degree of abnormality
        - An autoencoder was tested as an example of a simple one-class classifier
    - In the future, would also like to investigate use in early warning systems (strange events being picked out as they are reconstructed)
    - Question (room): question on narrow neural network, how is this related to RNN
        - James: I believe it's very similar, not the same, but strongly related
    - Question (Sergei): which software do you use for simple NN and auto-encoder?
        - James: sklearn and sk-learn addon, sknn
    - Question (Steven): what about expected changes between runs, such as different pileup levels?
        - James: would need to re-train with a new reference, no simple way
        - Steven: what about using something like a time-series, but a pileup series instead?
        - James: may be possible, but likely not enough time steps to be able to do this reliably

Viktor: Anomaly detection in CMS data quality monitoring
    - Future upgrades to the detectors and LHC will make data classification/certification a much bigger problem
        - Significantly increased amount of data, but same number of people to look at it
    - Focus on CMS hadronic calorimeter, the number of channels is going to double soon
    - Method shown in last IML meeting on clusterization via statistical moments
    - Applied the method as described, focused on 1st and 2nd moments of the significance distribution
    - Selected a few runs that were previously identified as either good or bad
        - bad runs were classified correctly using these variables
        - Able to spot problems in HCAL occupancy distributions, normally done by shifters looking at plots
        - Also able to identify problematic run timing distributions
    - However, clusterization is luminosity dependent
        - Seen clearly when scaling up to include more runs, the low lumi runs and very short runs are outliers
    - In the future, if this is to be used to classify lumi, it must be done section by section (not run by run)
    - Should also be split for each HCAL sub-detector, not just the full detector (later: extend to all CMS systems)
    - Comment (Viktor): curious to see what variables were used by ATLAS in previous talk
        - James: plots shown were just toy examples, not yet at the point of selecting variables
    - Question (Lorenzo): You have a dependence with luminosity, is that correct?
        - Viktor: occupancy is different with luminosity
    - Question (room): did you base you results on histograms, or on clustering algorithms?
        - Viktor: two histograms, one reference, one you are trying to classify
        - Calculate a significance of the difference between the histograms
        - Then look if you can see a difference using that
    - Question (Steven): likely improved separation if you split by lumi-sections or similar to reduce pileup/etc dependence
        - Viktor: yes, need to look by lumi-section

Maxim: Anomaly Detection and Yandex
    - Will cover both supervised anomaly detection at CMS and unsupervised anomaly detection at LHCb, starting with CMS
    - Goal is to let experts deal only with non-trivial cases
        - System learns to predict expert's response fo "good" vs "bad" labelled data
        - Most obvious cases are covered, ambiguous left to experts
        - System continuously learns from the experts
    - Divides data into three groups
        - Almost surely good, almost surely bad, and ambiguous sections
    - Three performance metrics: rejection rate, pollution rate, and loss rate
    - Sequential learning procedure defined with eight steps, slide 14
    - Can save a lot of time with minimal or even zero loss/pollution rates
        - Feasible to save 50-85% of manual work under reasonable constraints
    - However, expert decision was chosen as a reference
        - Experts also have some intrinsic loss and pollution rates
    - Future work:
        - Additional features, increased robustness
        - Replace "good" and "bad" with specifications of the problematic sub-detector
        - Studies done with 2010 data, needs to be updated
    - Moving to LHCb studies of unsupervised learning
        - Monitoring different trigger streams
        - Try to separate different runs
    - Project is in initial phase, but correlation between reported problems and classifier quality have been observed
    - Question (Sergei): Would be interesting for those doing DQ monitoring to keep track of mis-classification rate
        - How often does the shifter do something wrong?
        - Once they have input from these methods, watch if this rate improves
    - Question (Viktor): how do you select the features to use?
        - Maxim: brief summary of feature extraction procedure on slide 7
        - Some fixed-size features extracted for each event
        - Results in more than 1000 features, believe that each feature can provide increased quality
    - Question (Sergei): not managing anything detector level, this is more physics observables
        - Maxim: yes, correct, question is if we can identify defects using physical features

Andrew: Multiple loss functions in TMVA
    - BDTlib, package focusing on multiple loss functions so the user can focus on the data that matters to them
    - Some visible improvements with different loss function choices that were not previously supported
    - Consensus to integrate this package into TMVA, starting to do this now
    - Plans to parallelize the BDTs in TMVA
        - Multiple benefits from this (training, evaluation, etc)
    - Question (Steven): when would we expect this to be in TMVA?
        - Andrew: in the next TMVA build, but just the different loss functions (parallelization will come later)
        - Sergei: that means later this summer