IML meeting: February 3, 2016 Peak people on Vidyo: 23 Peak people in the room: 21 Tim: Intro and news - Likely another follow-up meeting for more tools contributions in a couple weeks time, in the afternoon - Trying to find a good compromise for people in different time zones and conflicting afternoon commitments - Multiple upcoming ML in HEP workshops (two later this month!) - Acutal ML conferences are also coming, but deadlines for submitting abstracts/papers are in the coming weeks Sergei/Omar: New and upcoming features in TMVA - Several new features are available, many more are in progress - Six months ago, came up with a document on the desired future of TMVA - making good progress on these items - Slide 2 summarizes what's done, what's in progress, and what's not yet underway - TMVA via new PyMVA and RMVA interfaces can now link to many external ML packages (xgboost, scikit-learn, etc) - New DataLoader class allows greater flexibility (currently TTrees, but could be extended to data frames etc) - Deep learning has been added, currently under final testing/validation - SVM will be discussed in following talk by Tom - Major internal updates are currently in progress - Removal of static variables to support parallelization - Creation of leightweight constructor which doesn't save outputs in ROOT file - Separation of classification/regression classes - DataLoader is planned to be extended to support .csv, HDF5, JSON, etc - If there is a useful format missing from their list (slide 14), please contact them - Work on integrating with ROOTBooks and the jupyter platform - TMVAGui is being updated, such as ability to visualize multiple datasets - New TMVA::CrossValidation class - Wil support parallel execution - Optional hyper parameter tuning - Additional deep learning plugins coming, such as darch to include GPU support - Parallelization is a major task, multiple levels of parallelization under discussion - Memory usage being revisited, especially for parallel execution or multiple datasets - No questions Tom: Development of support vector machines - Provides a nice introduction to SVMs - Soft margin SVM relaxes hard requirement: points can be on the wrong side with tunable costs - Kernel function choices discussed (standard is the inner product) - Please see slides for the details - SVMs already existed in TMVA, have been adding additional functions to make it easier to use - More kernel functions - Hyperparameter optimisation (kernel parameters and cost) - Cost weighted to signal/background dataset size ratio - Loss functions implemented but not yet used - When using SVMs, always use the option "VarTransform=Norm" to properly handle variables of different scales! (ex: pT vs eta) - Other options have been added/expanded with new features (see slides 11 and 12 for details) - Example of checkerboard dataset demonstrates out of the box SVM performance (very similar to BDT for this simple dataset) - Potential future developments: - More documentation - Performance contours in parameter space - More kernels - Optimisations/improvements to the algorithm - Question (Sergei): SVMs right now are a huge memory hog - 150k events is 30gb of memory - Reduced SVMs are claimed to improve this - have you tried this? - Tom: Not tried yet, but definitely interested in this - Question (Andreas): checkerboard example highly tuned to SVM performance - Would be great to follow up more HEP-like examples with more non-linear correlations - Also would be good to look into comparing current TMVA SVM with libSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) - Use as a benchmark to see if performance is similara - Lorenzo: This library is actually accessible TMVA now via RMVA interface, so comparison should be straightforward - Question (Pietro): in current implementation, can SVM be used with bagging techniques - Andreas: Yes, it should work, but hasn't been used much so far - Question (Andreas): SVM depends crucially on cost parameter - Tom: yes, huge dependence here - Out of the box might be too strict so training doesn't complete properly or too loose and allowing too many in - Definitely one of the main things that needs to be optimized - Part of the optimize method to address this - Andreas: would be good to make this part of the default SVM method in TMVA rather than only via an option - Commemtn: Good to compare with SVM-hint, to be presented in part 2 of this meeting (another day) Iurii: Scikit-learn to TMVA - XML converter tool - Found that ROC curve for BDT is below standard cuts for both electron and muon results - Decided to try a different MVA library: scikit-learn - Same approach now has ROC curves above the standard cut performance - May be how they are using TMVA, maybe not, but happy with scikit-learn results - However, no sklearn available in ATLAS software - Solution: convert classifier from sklearn to XML format readable by TMVA - Made skTMVA package to convert SK outputs to TMVA XML files - Available on github: https://github.com/yuraic/koza4ok - Currently supports BDT binary classification, AdaBoost and Grad Boosting - Slide 10 gives code usage example - Example demonstrates that sklearn and skTMVA have identical performance (model correctly transcribed) - Future plan: convert scikit-learn model to standalone C++ file - Thinking of doing so in a TMVA-like way - Question (Sergei): any plans to expand this to other input formats converted to TMVA? - Would be nice to have this, and if community would support, more likely to work on this - Not primary task, this is on the side, but would like to contribute if there is demand - Comment (Sergei): TMVA problems likely either the tool was used wrong, or ran into a problem with ROC calculation code - New ROC calculation class has been added to TMVA to fix this - Also noticed that how ROC is calculated in sklearn and TMVA is different - Would be interesting to see direct fair comparison - Comment (Dan): technical issue, current TMVA format couldn't handle desired deep neural network format - Ended up writing own class - Would be good to have TMVA come up with a way to provide enough flexibility that external packages can be used for training and model building, but evaluation done in TMVA - Sergei: The file format is changing, which will help, but are thinking about deeper changes - Marie: Problems we encountered documented in a previous meeting - Question (Lorenzo): can this be integrated into TMVA - Should be possible using a python in C++ API - Comment (Gilles): sklearn-compiledtrees handles the sklearn to C++ conversion, no need to do that step - Question (Christina): Why is TMVA trying to re-implement all algorithms if it already integrates R and python? - Andreas: No need to do so any more. When started in 2005, lots of incoming methods, lots of not apples-to-apples comparisons - Real interest was to bring together all these methods in a framework which enforces fair comparisons and easy to use for people who don't want to spend time on learning new techniques - Now ML is everywhere and new techniques are coming up, so good question, do we need to continue to do this? - Probably not, some people will still like the TMVA interface, but good idea to have TMVA be able to work with other tools rather than re-implementing everything - Sergei: new interfaces will make this easier, trying to move toward having support for general external packages - Dan: group could really help by providing information on how to convert between formats - IML should provide a page that compiles all of the useful tools/scripts with their intended use cases Andrey: Classifier training and optimization - the reproducible way - REP, the Reproducible Experiment Platform - Python-based, Jupyter-friendly - Unified API to many ML packages - Meta-lgorithm pipelines = REP lego - Configurable interactive reporting and visualization - Pluggable quality metrics - Parallelized training of classifiers and grid search - Demo server: https://lhcb-rep.cern.ch, password "rep" - Github: https://github.com/yandex/rep - Talk is full of useful examples and documentation for how to use REP - See slides for details, examples, and links - Can train different models from different libraries in parallel - Example of using XGBoost as input to AdaBoost - Multiple reporting/plotting options (matplotlib, ROOT, plotly, etc) - Can define your own metrics - Question (Sergei): Example of everware? - Tim: Done in LHCb, in theory if you had access to LHCb data, could re-run entire analysis - People doing the analysis do a bit of work to make it available in everware - Analysis reviewer can run the notebook and ensure all of the plots match what is provided by the group (ensure reproducibility) - Sergei: What are the requirements for making everware work? - Tim: git repository public to the person who will clone, docker repository where it can run AOB: Sergei: Would it make more sense to have an additional meeting for part 2 of tools discussion in a couple weeks, or wait for regular ~monthly schedule - Room is evenly split, will discuss amongst organizers and send out an announcement soon