IML meeting: February 3, 2016
Peak people on Vidyo: 23
Peak people in the room: 21

Tim: Intro and news
    - Likely another follow-up meeting for more tools contributions in a couple weeks time, in the afternoon
    - Trying to find a good compromise for people in different time zones and conflicting afternoon commitments
    - Multiple upcoming ML in HEP workshops (two later this month!)
    - Acutal ML conferences are also coming, but deadlines for submitting abstracts/papers are in the coming weeks


Sergei/Omar: New and upcoming features in TMVA
    - Several new features are available, many more are in progress
    - Six months ago, came up with a document on the desired future of TMVA
        - making good progress on these items
        - Slide 2 summarizes what's done, what's in progress, and what's not yet underway
    - TMVA via new PyMVA and RMVA interfaces can now link to many external ML packages (xgboost, scikit-learn, etc)
    - New DataLoader class allows greater flexibility (currently TTrees, but could be extended to data frames etc)
    - Deep learning has been added, currently under final testing/validation
    - SVM will be discussed in following talk by Tom
    - Major internal updates are currently in progress
        - Removal of static variables to support parallelization
        - Creation of leightweight constructor which doesn't save outputs in ROOT file
        - Separation of classification/regression classes
    - DataLoader is planned to be extended to support .csv, HDF5, JSON, etc
        - If there is a useful format missing from their list (slide 14), please contact them
    - Work on integrating with ROOTBooks and the jupyter platform
    - TMVAGui is being updated, such as ability to visualize multiple datasets
    - New TMVA::CrossValidation class
        - Wil support parallel execution
        - Optional hyper parameter tuning
    - Additional deep learning plugins coming, such as darch to include GPU support
    - Parallelization is a major task, multiple levels of parallelization under discussion
    - Memory usage being revisited, especially for parallel execution or multiple datasets
    - No questions


Tom: Development of support vector machines
    - Provides a nice introduction to SVMs
    - Soft margin SVM relaxes hard requirement: points can be on the wrong side with tunable costs
    - Kernel function choices discussed (standard is the inner product)
    - Please see slides for the details
    - SVMs already existed in TMVA, have been adding additional functions to make it easier to use
        - More kernel functions
        - Hyperparameter optimisation (kernel parameters and cost)
        - Cost weighted to signal/background dataset size ratio
        - Loss functions implemented but not yet used
    - When using SVMs, always use the option "VarTransform=Norm" to properly handle variables of different scales!  (ex: pT vs eta)
    - Other options have been added/expanded with new features (see slides 11 and 12 for details)
    - Example of checkerboard dataset demonstrates out of the box SVM performance (very similar to BDT for this simple dataset)
    - Potential future developments:
        - More documentation
        - Performance contours in parameter space
        - More kernels
        - Optimisations/improvements to the algorithm
    - Question (Sergei): SVMs right now are a huge memory hog
        - 150k events is 30gb of memory
        - Reduced SVMs are claimed to improve this - have you tried this?
        - Tom: Not tried yet, but definitely interested in this
    - Question (Andreas): checkerboard example highly tuned to SVM performance
        - Would be great to follow up more HEP-like examples with more non-linear correlations
        - Also would be good to look into comparing current TMVA SVM with libSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
        - Use as a benchmark to see if performance is similara
        - Lorenzo: This library is actually accessible TMVA now via RMVA interface, so comparison should be straightforward
    - Question (Pietro): in current implementation, can SVM be used with bagging techniques
        - Andreas: Yes, it should work, but hasn't been used much so far
    - Question (Andreas): SVM depends crucially on cost parameter
        - Tom: yes, huge dependence here
        - Out of the box might be too strict so training doesn't complete properly or too loose and allowing too many in
        - Definitely one of the main things that needs to be optimized
        - Part of the optimize method to address this
        - Andreas: would be good to make this part of the default SVM method in TMVA rather than only via an option
    - Commemtn: Good to compare with SVM-hint, to be presented in part 2 of this meeting (another day)


Iurii: Scikit-learn to TMVA - XML converter tool
    - Found that ROC curve for BDT is below standard cuts for both electron and muon results
    - Decided to try a different MVA library: scikit-learn
    - Same approach now has ROC curves above the standard cut performance
    - May be how they are using TMVA, maybe not, but happy with scikit-learn results
    - However, no sklearn available in ATLAS software
    - Solution: convert classifier from sklearn to XML format readable by TMVA
    - Made skTMVA package to convert SK outputs to TMVA XML files
        - Available on github: https://github.com/yuraic/koza4ok
        - Currently supports BDT binary classification, AdaBoost and Grad Boosting
    - Slide 10 gives code usage example
    - Example demonstrates that sklearn and skTMVA have identical performance (model correctly transcribed)
    - Future plan: convert scikit-learn model to standalone C++ file
        - Thinking of doing so in a TMVA-like way
    - Question (Sergei): any plans to expand this to other input formats converted to TMVA?
        - Would be nice to have this, and if community would support, more likely to work on this
        - Not primary task, this is on the side, but would like to contribute if there is demand
    - Comment (Sergei): TMVA problems likely either the tool was used wrong, or ran into a problem with ROC calculation code
        - New ROC calculation class has been added to TMVA to fix this
        - Also noticed that how ROC is calculated in sklearn and TMVA is different
        - Would be interesting to see direct fair comparison
    - Comment (Dan): technical issue, current TMVA format couldn't handle desired deep neural network format
        - Ended up writing own class
        - Would be good to have TMVA come up with a way to provide enough flexibility that external packages can be used for training and model building, but evaluation done in TMVA
        - Sergei: The file format is changing, which will help, but are thinking about deeper changes
        - Marie: Problems we encountered documented in a previous meeting
    - Question (Lorenzo): can this be integrated into TMVA
        - Should be possible using a python in C++ API
    - Comment (Gilles): sklearn-compiledtrees handles the sklearn to C++ conversion, no need to do that step
    - Question (Christina): Why is TMVA trying to re-implement all algorithms if it already integrates R and python?
        - Andreas: No need to do so any more.  When started in 2005, lots of incoming methods, lots of not apples-to-apples comparisons
        - Real interest was to bring together all these methods in a framework which enforces fair comparisons and easy to use for people who don't want to spend time on learning new techniques
        - Now ML is everywhere and new techniques are coming up, so good question, do we need to continue to do this?
        - Probably not, some people will still like the TMVA interface, but good idea to have TMVA be able to work with other tools rather than re-implementing everything
        - Sergei: new interfaces will make this easier, trying to move toward having support for general external packages
        - Dan: group could really help by providing information on how to convert between formats
                - IML should provide a page that compiles all of the useful tools/scripts with their intended use cases

Andrey: Classifier training and optimization - the reproducible way
    - REP, the Reproducible Experiment Platform
        - Python-based, Jupyter-friendly
        - Unified API to many ML packages
        - Meta-lgorithm pipelines = REP lego
        - Configurable interactive reporting and visualization
        - Pluggable quality metrics
        - Parallelized training of classifiers and grid search
        - Demo server: https://lhcb-rep.cern.ch, password "rep"
        - Github: https://github.com/yandex/rep
    - Talk is full of useful examples and documentation for how to use REP
        - See slides for details, examples, and links
    - Can train different models from different libraries in parallel
    - Example of using XGBoost as input to AdaBoost
    - Multiple reporting/plotting options (matplotlib, ROOT, plotly, etc)
    - Can define your own metrics
    - Question (Sergei): Example of everware?
        - Tim: Done in LHCb, in theory if you had access to LHCb data, could re-run entire analysis
        - People doing the analysis do a bit of work to make it available in everware
        - Analysis reviewer can run the notebook and ensure all of the plots match what is provided by the group (ensure reproducibility)
        - Sergei: What are the requirements for making everware work?
        - Tim: git repository public to the person who will clone, docker repository where it can run


AOB:
    Sergei: Would it make more sense to have an additional meeting for part 2 of tools discussion in a couple weeks, or wait for regular ~monthly schedule
        - Room is evenly split, will discuss amongst organizers and send out an announcement soon