IML meeting: June 16, 2017 Peak people on Vidyo: 33 Peak people in the room: 28 Sergei: Intro and news - CMS now has an ML forum, convened by Sergei and Maurizio Pierini, workshop early July - ATLAS ML workshop happened two weeks ago - New monthly LPCC data science seminar series - See series category on indico: indico.cern.ch/category/9320/ - ML Community White Paper (CWP) effort continues - Aiming towards HSF Annecy workshop, June 26-30 - Important effort for the future of ML in HEP - New Marie-Curie training network on stats and ML in HEP, INSIGHTS - Mostly for PhDs and early career researchers - Around 12 or 13 PhD students over around 4 years - Students will work on HEP applications, software, and tools - MLHEP Summer School in Reading, UK (17-23 July) - Fermilab ML meeting, July 14 - Next meeting will be July 13 in Salle Bohr (40-S2-B01) - Topic: trigger applications and community training - See indico for details: https://indico.cern.ch/event/638056/ Matthew Feickert: HEPML Resources - Knowledge Repository for HEP ML Work - Started github repository under IML project https://github.com/iml-wg/HEP-ML-Resources - Snapshot the ML work and knowledge of the HEP community - Current ML information, software, lectures and seminars, papers, workshops - Links are in the slides - Built while learning about ML, and discovered how much was already out there - Intend to provide centralized area, inspired by "Awesome X" repositories - If you think this is a good idea, please get involved - Right now, repository focused on ATLAS, as that's the speaker's collaboration - Great to get more resources from other experiments to ensure good representation - Need support to keep it current and updated - To contribute, take a look at the contributing document in the repository - Several different ways to contribute, depending on time investment - Question (Rob Fletcher): github has wiki feature - In addition to just markdown code might be useful - Matthew: yes, that's an excellent idea - Question (Sergei): great resource - also on us (authors of papers, etc) to put our work there when public Josh Bendavid: Use of Machine Learning Techniques for improved Monte Carlo Integration - Enormous amount of details and explanations in the slides - Below is a small sampling, please see slides for full details - Integration: Given arbitrary multidimensional function, want to find the integral - Generation: Given an arbitrary multidimensional function, generate an unweighted set of vectors with a probability density - Typical algorithm: construct appropriate sampling function, then generate large number of events to evaluate integral - VEGAS: construct product of 1D histograms, quite simple, but non-trivial correlations introduce hard limit to precision - Foam: divide phase space to hyper-rectangles with optimized boundaries, can estimate non-trivial correlations - Close analogue to simple decision trees - Why not try boosted decision trees? Know to work better than simple trees - Can do similar things for deep neural networks, focused on generative models - Several different approaches (generative adversarial networks, autoencoders, ...) - For any given state of a generative network, if input and output space have same dimensionality, can compute probability - Introduce function approximator, such as standard DNN regression, together with a weak iterative procedure - Addresses problems if function and/or derivatives are difficult/expensive to evaluate - Comparisons of VEGAS with Foam with BDTs vs Generative DNNs (slide 33) - ML doing excellent job, with minimal function evaluations - 9D integration compared (slide 36) - BDT does well but with more evaluations than VEGAS, but Generative DNN scales much better with dimensionality - Since ML for importance sampling, always have an idea of how good of a job it's doing - May need to generate more or less events depending on network precision - However, final performance of the integration is controllable - Question (Greame): diagnostics, one specific network or a group - Josh: for one specific network - First train network, then generate large number of samples, compute integration weight for each sample - From that, get the distribution on slide 37 - Greame: comments made about tail, might depend on particular DNN - might want to try different networks to see if it varies - Josh: agreed, fair point Bob Stienen: SUSY-AI and the iDark project - In BSM searches, no signs have been found, so we set limits - Takes a long time to exclude regions of parameter space - Excluding a single model point can take hours - Can only do this fully if you're in one of the experimental collaborations - SUSY-AI works in 19d pMSSM space - 310324 model points with known exclusions used as data input - Want to be able to do interpolation on this set of model points for those not generated - Replacing the complex procedure with ML can be done in ms, not hours - Does a very good job of representing the parameter space - Not perfect, not 1-1 correspondence, but 93% accuracy at both 8 and 13 TeV - Already good, but want to do better - What is the probability that, given my classifier output, the prediction is equal to the majority class in that bin - Only believe SUSY-AI prediction if it has a X% probability of being correct - Horizontal lines on slide 8 - If we insist on 95% confidence, we get 99% accuracy using 70% of the total data - If we go to 99% CL, we get 99.7% accuracy on 50% of the total data - If you use SUSY-AI where it's 99.7% accurate and simulation for the other points, you cut time in half and are certain of the result - That's current status, now looking forward - Next SUSY-AI will not be pMSSM exclusive anymore, will have different types - Users will specify the configuration - Stacking will try running multiple classifiers on the same dataset at the same time - SUSY-AI runs on the combined result of 22 individual analyses, trains on the "Result" column (slide 14) - Reduces amounts of info from SUSY-AI, lose physical meaning of which analysis excludes it - Now, you can get info on which analysis says what - Server-client option added - Speeds up time to load classifiers for a given model point (avoid reloading for every operation) - Given a single model point in parameter space, can we extract information from its vicinity? - "Boundary exploration" - Cannot easily be done with the ordinary workflow, maybe SUSY-AI can help with this - Obtaining data is still a problem - Time consuming to generate, hard to make it public - iDark will host public database and plotting interface - idarksurvey.com for online demo - General summary - SUSY-AI already fast and reliable, but being further improved - Next version of SUSY-AI will be public in a few weeks - Lack of data will be addressed by iDarkSurvey - Question (Steven): can imagine expanding this to other areas, such as DM surveys, has this been considered? - Bob: Absolutely, is a use there in moving away from simplified models - Can create this multi-dim information within this algorithm - Currently aimed at SUSY, named SUSY-AI, but generalizes to AInalyses - Anyone with data on any parameter space could in principle make a classifier that runs through this program - Yes, we want to generalize this to dark matter surveys and other problems - Question (Sergei): did you study other methods than random forests - Bob: tried everything in scikit-learn, they came out to be the most reliable and fastest training - Remain with this algorithm, will behave correctly at energies higher than sampled as exclusion boundary doesn't change with energy - Sergei: probably since did it early on, done before NN, as weren't in scikit-learn yet - Bob: yeah, want to try with neural networks, may be worthwhile to do so Zahari Kassabov: Learning parton densities with neural networks - The NNPDF Methodology - Functions f_i(x,M_x^2) need to be learned from data - PDF of parton i carrying a fraction of momentum x at scale M_x - NNPDF 3.1 NNLO produced last month, public arxiv:1706.00428 - We want to both determine the PDFs and obtain a sensible estimate of their uncertainty - Uncertainties on input experimental data - Degenerate minima (+inefficiencies on the minimization) - Theory uncertainties (value of alpha_s, etc) - Not a well researched topic in ML - Constraints come from convolutions, and not so much data, only 4285 data points - Not really "big data", but still a complicated problem - 7 physical processes from 14 experiments over ~30 years - Compared to standard ML problem - Require statistically sound uncertainty estimate - Problem is regression but available data has complex dependence on PDFs - There are some physical constraints - NNPDF approach - Since we don't have constraints, we should have a very general model - Use a neural network, fully connected, two sigmoid hidden layers, one linear layer - 296 network parameters - Propagate experimental uncertainties by doing many fits with different fluctuations of data - Experiments give us covariance matrix - We sample data from experiments according to the covariance matrix - Statistics of PDFs calculated from the ensemble of PDF replicas - Fit PDFs using genetic algorithms, would really like to improve this - At each iteration, generate 80 mutants and select best mutant - Very easy to implement and understand, good dealing with complex analytic behaviour, doesn't require computing gradient - May not be close to global minimum, requires many function evaluations (convolutions), needs tuning - Closure tests - Assume the underlying PDF is known - Generate data, fluctuating around the prediction of the true PDF - Perform a fit and compare with assumed PDF - Check that the results are consistent - Various levels of closure test (slide 22) - No questions Daniel Krefl: ML of CY volumes - Experimental Mathematics - Even in formal theory or mathematics, we are starting to use ML to our toolbox - The question - is the minimum volume a function of the geometry? - Calculate Vmin numerically (complex but possible) - Use ML to investigate the function - Can generate an infinite train and test set - Generated ~10k data points with 75/25 train/test split - From 100k data points, but remove related diagrams to ensure train/test are distinct) - First ansatz: linear regression - Second ansatz: using CNN, "deep and wide nets" - Question (Michele): why linear regression? - Daniel: stabilize the tails, problem is learning the tails of this distribution - Michele: don't you also lose some info - Daniel: yes, but here we need the tails under control and can sacrifice some info from center to do this - Found the minimum volume can be approximated from topological data via ML models - Concept can be applied to many other conjectured relations - "Experimental mathematics" - Discovery engine to find new relations (via statistical evidence) - Which subsequently should be made rigorous - Or maybe not, would an AI do math approximate/data-driven? - Question (Greame): inputs you had at top and bottom, same inputs? - Daniel: yes, looks like an autoencoder, but it's not - Fitting simultaneously both inputs, and they are the same inputs - Michele: how did you merge your branches? - Daniel: just concatenate them - Sergei: we're moving towards more automatic features - Daniel: This tries to combine both hand-crafted and automatic features - Question (Paul): would give a more in-depth talk next week, which event? - Daniel: String theory colloquium on Tuesday - Will be more in detail on the theory side rather than ML - https://indico.cern.ch/event/646694/ - Question (Sergei): which tools? - Daniel: standard tools, python and keras based Long-Gang Pang: EoS-meter of QCD transition from deep learning - Postponed to a future meeting due to Vidyo issues