IML meeting: April 14, 2016 Peak people on Vidyo: 39 Peak people in the room: 27 Michele: Intro and news - The next meeting will take place on May 17 at 15:00 in the main auditorium - The focus will be on regression - See the slides for information on other upcoming events and software news Harrison: Bayesian neural networks and general remarks - Need to consider both cost and benefit to determine usefulness - 10% benefit but huge increase in computing time is not really useful - Very informative and detailed slides, please look at them rather than these minutes - Question, Sergei: - are you using shallow BNNs? How many hyperparameters compared to other NNs? - Harrison: yes, comparable number - Question, Steven: You said there is "no free lunch", what is your opinion on autoML and similar - Harrison: May be some methods that work in some circumstances - No uniformly most powerful method - Most important to figure out how to build something practical - Need to put time into things that work - Focus on methods that have ways to actually define error bars, otherwise not useful Michela: Introduction to NNs using Keras - Keras is modular and user-friendly library for deep learning built on Theano and TensorFlow - Very well documented with lots of working examples - Neural networks = stack of tensor operations - Sequential model: linear stack of layers - Easy in Keras, code example in slides - Graph model: multi-input, multi-output, arbitrary internal connections - Also possible, example not in slides - Dense layers - Core unit of a multi-layer perceptron (MLP) - Linear transformation which shifts the input linearly to an output, u = Wx+b - All entries in W and b are trainable - Only need to specify number of inputs and outputs for basic dense layer (only #outputs after first layer) - Question, Vidyo: what about the output layer for multi-output? - Michela: Can write Dense(1) for one output, Dense(N) for N-class output problem - Different classes, each one has specific label, multi-dim output has probability of specific event to be in class 0, 1, etc - Activation function - Can be interpreted as a layer itself or attached to dense layer (both possible in Keras) - Quantify the activation state of a node, whether it's firing or not - Non-linear activation functions are key to Deep Learning - Some examples of different activation functions given on the slides - Usually sigmoid used for final layer (gives value between 0 and 1, a probability) - However, not good for intermediate layers, use others instead - Weights initialization - Before training, have to begin with some value - Initial values should be suitable such that it can converge quickly - Lots of different initialization strategies available in Keras, listed on slides - Forward propagation - Transforms input vector of features through layers to obtain the final output - Depends both on input vector and current weight values in each layer - Loss function - Dictates how strongly we penalize different types of mistakes - Cost of inaccurately classifying an event - Used to evaluate the performance of the NN (can't use area under the curve here) - List of common loss functions listed, but very problem dependent, need to know your problem - Optimizers - This is just an optimization problem, the optimizer controls what method for converging is used - Stochastic gradient descent is most common, but others exist and are supported - Back propagation - The hard part of ML, derivatives through the system - Loss must be differentiable with respect to any parameter - Modern DL libraries like Keras use tensor math libraries that handle all of this in a very optimized way - Can compile CUDA code to run on the GPU (or machine instructions for CPU) - Had to skip the next couple slides for time - Training a very simple NN shown on one slide - Nice visualization tool linked (playground.tensorflow.org) - Question, Sergei: details on the visualization - Michela: can use whatever, not just tensorflow - Shows what happens graphically as hyperparameters are changed and similar - Question, Sergei: does Keras have cross-validation implemented - Gilles: no cross-validation in Keras, but can use scikit-learn for this - Question, Sergei: support for keras, was originally one person - Michela: new version, haven't used it too much - New graphical interface, most examples work on either release interchangably - Also, started with one person, but now hundreds of people contribute - Growing community, very well documented, well supported, you can contribute too! - Try to keep it as general as possible, so not super-specific layers, but can still do it yourself on a branch - Also, Keras has both python and C++ API (TensorFlow C++ API) - So very useful to train in python and then export in C++ Paul: deep neural networks with domain adaptation - Common problem: classifier works great in simulation, but performance degrades significantly in real data - Should we just make MC better? - Time and work improving the simulation will of course help - However, how good does it need to be? Can you make it perfect? - Outside of particle physics, also have differences between product photos, DSLR photos, webcam photos, etc - Ideally, want to train a classifier that distinguishes between classes free of noise and oblivious to the "domain" (data/MC, DSLR/webcam) - Idea: train an additional network output to distinguish data from MC - If domain classifier cannot distinguish data from MC, then features are independent of domain - Training now needs three samples: signal MC, background MC, mixed data - Got the author's code from github for caffe - Used data similar to the Kaggle challenge - Forced network to use variables which may be problematic between data/MC - Domain loss factor is reduced from this approach - TMVA and Caffe DNN without GRL at the same level - signal/background doesn't really improve - So far has not found a training which could really take data/MC differences into account - Caffe network was not using optimized variables while TMVA was, so maybe benefits possible - However, not the point of this exercise - Question, Sergei: training it with both MC and data in the mix - adding domain classifier and continuing training, are you pausing and then resuming? - Paul: yes, that's about right - Technically create new network, but copy all the weights from already trained network to new network - Question, Vidyo: Missed due to room microphone problems - Comment, Kyle: don't have minimax scenario common to this adversarial training - Could be part of the reason you don't see large gains - Gilles: in this paper, still looking for saddle point, not a global minimum Dan: Jet substructure classification with deep neural networks - Some EW particle which decays to two prongs, need to distinguish this from QCD - Many "engineered" substructure variables accounting for things like # of subjets, mass, and energy distributions - If we're going to use ML, do we need engineered variables? - Jet images: a jet is just an image in the calorimeter, 32x32 = 1024 pixels - Can then use standard image recognition - Tried both BDT (scikit-learn) with engineered variables and deep learning (Keras) with jet images, comparing which was better - Approximately 750k free parameters in both cases - Deep networks outperform BDTs on engineered variables by a small amount - Both are much better than cuts on engineered variables - From checks, the NNs are learning the same higher-level features, especially the mass - The theorists have done a very good job of creating high-level variables that contain almost all the relevant information - Would be interesting to extend to 3D, as calorimeters are 3D images - Question, Rosa: you say NN has learned about the same as the BDT, but have you given the NN enough to surpass the BDT? - Dan: all simulation based, 10 million jets - Couldn't see any large differences with more statistics - Question, Sergei: what about time? How long did these take? - Dan: Order of a day to train the NN on a cluster that has GPUs and similar - BDTs were about the same amount of time (no GPU) - But could probably get quite a bit of this performance out of much smaller architectures - Question, Sergei: are these feed-forward networks? - Dan: simple feed-forward neural networks - Three or four layers of locally-connected layers, then fully connected dense layers - Question, Steven: have you look at higher pT, most of the substructure variables work great in this semi-resolved regime but not at high pT when calo granularity becomes a limitation - Dan: have not looked at, but good idea for the future - Question, Mike: have you tried transverse energy instead of energy? Energy is not boost invariant - Dan: another good thing to look at Adam: Deep learning at NOvA - Long-baseline neutrino oscillation experiment - At NOvA, data taken in 550 microsecond intervals - Handscans are done to find neutrinos, mostly it's just cosmics/background - Clustering in space and time creates slices - Helps to separate nutrinos, cosmic rays, and noise - Traditional neural nets great, but have deficiencies - Doesn't scale very well to raw data - Important to reduce raw data down to few powerful well-engineered features - Failing to generalize to other data - Multi-layer functions can approximate function with fewer nodes than single-layer, good way to go - Rectified linear units (ReLU) have really made all of this possible - Convolutional neural nets are one powerful option - Many common kernels exist, but we want to learn optimal kernels from the data - Convolutional layers have a small number of weights shared across the image - Tried several different models by different computer vision groups for image classification contests - LeNet, AlexNet, VCG, and GoogLeNet - GoogLeNet performed the best and was faster - Converted inputs to images, only did hit clustering (no other reco) - Took relatively large boxes (80x100 pixels), 18.5m deep and 4m wide - Created Convolutional Visual Network (CVN), very strongly based on GoogLeNet - Two shorter versions of GoogLeNet running in parallel - Used several methods to avoid overtraining, detailed in slides - Ve: Obtained significant improvements (40%) in efficiency for same purity - Primarily due to improved efficiency in resonances and deep inelastic scattering - Vmu: minimal improvement as not much more to learn, but still a bit better, so not underperforming - GPU acceleration can train this in ~1 week, CPU would be much longer - Question, Dan: First step of processing was to project into two dimension, why not 3D? - Adam: data comes in 2D views. You can construct 3D representations, but takes some reconstruction to do that - Trying to think of ways to sidestep these reconstruction issues by using the 2D data - Detector cells are stacked either horizontally or vertically, not both - The raw data is inherently two dimensional - Planes alternate, so can be done, but intrinsically is 2D - Question, Kyle: what is the reaction in the collaboration to this? Loving/skeptical/mix? - Adam: Interesting from sociological point of view - This is six months of work - At first, uniform skepticism - First results (40% improvement), even more skepticism - Running the network on large number of systematically shifted samples, gains were not illusory, was robust - Data/MC disagreements in unblind near detector were about same as before - Now, no more detractors, full support from collaboration - Original detractors: "can't argue with results" - Question, Tim: Something that takes a week to train, how did you get started? - Did you find a way to learn the mistakes you were going to make in less time? - Adam: yes, week was for final training of production version - Good idea of heading in right direction within half a day - Would set it up running, submit to nodes interactively, and watch training and testing losses and testing accuracies - Usually pretty apparently fairly quickly that trends not going in a good direction Amir: Deep learning reconstruction, LArTPC to ATLAS Calorimeter - Why do we want to go deep? - Better algorithms, hopefully outperforms hand crafted algorithms, unsupervised learning (anomaly detection etc) - Faster algorithms, can determine precision without going through all combinatorics, already parallelized - Easier development, feature learning vs feature engineering, save on development time and costs - Many problems in HEP software landscape - Complexity, future architectures, computation, expensive, and development and maintenance - Don't provide much support for software R&D - LBNF/DUNE will run for a long time, major project - Liquid Argon Time Projection Chambers (LArTPC) - Fundamentally read two dimensions, but can project into 3D with some care - Goal of 80% neutrino efficiency, current generation roughly half of this - Most neutrino physics: need to know flavour of the neutrino and the energy - User-assisted reconstruction is common for LArTPC reconstruction - Small datasets, so automated not needed in the past - Fully automatic reconstruction is now coming together - Did feasibility study with GoogLeNet with 224x224 pixel images - 90% electron efficiency for 2% fake rate with no real effort, blindly throwing data without optimization - This is better than has been seen before, so very good start - Have also studied custom DNNs of various types - Towards reconstruction with DNNs, starting with calorimeters - Essentially 3D image, very similar to LArTPC work - If we improve the ID and energy resolution, can make peaks stand out - Can also turn into generative model for fast simulation, save lots of CPU - Lots of potential uses in other places too, not even just calorimeters - Can consider putting all 200k cells of the calorimeter as an input to the DNN - Need very large samples, with full simulation, not in standard datasets - Open effort for general calorimetry studies, done outside of ATLAS (aimed at general LHC/ILC), all are welcome to join - Question, Steven: Handling different granularity of different parts of the calorimeter - Amir: Throw it into a network and see if it can figure it out - Thought about how to use convolutional neural network - Maybe 2D CNN on each layer, and then then have same level of granularity - Stack feature maps on top of each other - Different layers have to be somehow coupled together to represent the same feature - Question, Sergei: Calorimeters of today and calorimeters of tomorrow may not be the same - It would be interesting to see some R&D study with these standard calo designs - Some general wider future calorimeter - Amir: completely agree, good compromise as already in Geant4 - So ATLAS is good place to start - Challenge is to be able to do this with the full community rather than just within an experiment - Need to be able to have a general discussion across experiments, not just bound by one - Question, Sergei: challenges on calo and DNN? - Dan: If possible better to have just public data/etc and get people to use the data and write papers - Amir: completely agree, people will optimize for that particular challenge - Kyle: several people signed idea of trying to make public simulators available for multiple uses - Talk at NIPs criticizing challenges and variations that have similar feel but avoid some of the traps that have been encountered - Lots of work for future data science for HEP is along path of making public simulators - Sergei: would really also want people within the collaborations working on this, not just external people like you mostly have in challenges