IML meeting: April 14, 2016
Peak people on Vidyo: 39
Peak people in the room: 27


Michele: Intro and news
    - The next meeting will take place on May 17 at 15:00 in the main auditorium
    - The focus will be on regression
    - See the slides for information on other upcoming events and software news


Harrison: Bayesian neural networks and general remarks
    - Need to consider both cost and benefit to determine usefulness
        - 10% benefit but huge increase in computing time is not really useful
    - Very informative and detailed slides, please look at them rather than these minutes
    - Question, Sergei: - are you using shallow BNNs? How many hyperparameters compared to other NNs? 
	- Harrison: yes, comparable number
    - Question, Steven: You said there is "no free lunch", what is your opinion on autoML and similar
        - Harrison: May be some methods that work in some circumstances
        - No uniformly most powerful method
        - Most important to figure out how to build something practical
        - Need to put time into things that work
        - Focus on methods that have ways to actually define error bars, otherwise not useful


Michela: Introduction to NNs using Keras
    - Keras is modular and user-friendly library for deep learning built on Theano and TensorFlow
    - Very well documented with lots of working examples
    - Neural networks = stack of tensor operations
    - Sequential model: linear stack of layers
        - Easy in Keras, code example in slides
    - Graph model: multi-input, multi-output, arbitrary internal connections
        - Also possible, example not in slides
    - Dense layers
        - Core unit of a multi-layer perceptron (MLP)
        - Linear transformation which shifts the input linearly to an output, u = Wx+b
        - All entries in W and b are trainable
        - Only need to specify number of inputs and outputs for basic dense layer (only #outputs after first layer)
    - Question, Vidyo: what about the output layer for multi-output?
        - Michela: Can write Dense(1) for one output, Dense(N) for N-class output problem
        - Different classes, each one has specific label, multi-dim output has probability of specific event to be in class 0, 1, etc
    - Activation function
        - Can be interpreted as a layer itself or attached to dense layer (both possible in Keras)
        - Quantify the activation state of a node, whether it's firing or not
        - Non-linear activation functions are key to Deep Learning
        - Some examples of different activation functions given on the slides
        - Usually sigmoid used for final layer (gives value between 0 and 1, a probability)
        - However, not good for intermediate layers, use others instead
    - Weights initialization
        - Before training, have to begin with some value
        - Initial values should be suitable such that it can converge quickly
        - Lots of different initialization strategies available in Keras, listed on slides
    - Forward propagation
        - Transforms input vector of features through layers to obtain the final output
        - Depends both on input vector and current weight values in each layer
    - Loss function
        - Dictates how strongly we penalize different types of mistakes
        - Cost of inaccurately classifying an event
        - Used to evaluate the performance of the NN (can't use area under the curve here)
        - List of common loss functions listed, but very problem dependent, need to know your problem
    - Optimizers
        - This is just an optimization problem, the optimizer controls what method for converging is used
        - Stochastic gradient descent is most common, but others exist and are supported
    - Back propagation
        - The hard part of ML, derivatives through the system
        - Loss must be differentiable with respect to any parameter
        - Modern DL libraries like Keras use tensor math libraries that handle all of this in a very optimized way
        - Can compile CUDA code to run on the GPU (or machine instructions for CPU)
    - Had to skip the next couple slides for time
    - Training a very simple NN shown on one slide
    - Nice visualization tool linked (playground.tensorflow.org)
    - Question, Sergei: details on the visualization
        - Michela: can use whatever, not just tensorflow
        - Shows what happens graphically as hyperparameters are changed and similar
    - Question, Sergei: does Keras have cross-validation implemented
        - Gilles: no cross-validation in Keras, but can use scikit-learn for this
    - Question, Sergei: support for keras, was originally one person
        - Michela: new version, haven't used it too much
        - New graphical interface, most examples work on either release interchangably
        - Also, started with one person, but now hundreds of people contribute
        - Growing community, very well documented, well supported, you can contribute too!
        - Try to keep it as general as possible, so not super-specific layers, but can still do it yourself on a branch
        - Also, Keras has both python and C++ API (TensorFlow C++ API)
        - So very useful to train in python and then export in C++


Paul: deep neural networks with domain adaptation
    - Common problem: classifier works great in simulation, but performance degrades significantly in real data
    - Should we just make MC better?
        - Time and work improving the simulation will of course help
        - However, how good does it need to be? Can you make it perfect?
    - Outside of particle physics, also have differences between product photos, DSLR photos, webcam photos, etc
    - Ideally, want to train a classifier that distinguishes between classes free of noise and oblivious to the "domain" (data/MC, DSLR/webcam)
    - Idea: train an additional network output to distinguish data from MC
    - If domain classifier cannot distinguish data from MC, then features are independent of domain
    - Training now needs three samples: signal MC, background MC, mixed data
    - Got the author's code from github for caffe
    - Used data similar to the Kaggle challenge
    - Forced network to use variables which may be problematic between data/MC
    - Domain loss factor is reduced from this approach
    - TMVA and Caffe DNN without GRL at the same level
    - signal/background doesn't really improve
    - So far has not found a training which could really take data/MC differences into account
    - Caffe network was not using optimized variables while TMVA was, so maybe benefits possible
        - However, not the point of this exercise
    - Question, Sergei: training it with both MC and data in the mix
        - adding domain classifier and continuing training, are you pausing and then resuming?
        - Paul: yes, that's about right
        - Technically create new network, but copy all the weights from already trained network to new network
    - Question, Vidyo: Missed due to room microphone problems
    - Comment, Kyle: don't have minimax scenario common to this adversarial training
        - Could be part of the reason you don't see large gains
        - Gilles: in this paper, still looking for saddle point, not a global minimum


Dan: Jet substructure classification with deep neural networks
    - Some EW particle which decays to two prongs, need to distinguish this from QCD
    - Many "engineered" substructure variables accounting for things like # of subjets, mass, and energy distributions
    - If we're going to use ML, do we need engineered variables?
    - Jet images: a jet is just an image in the calorimeter, 32x32 = 1024 pixels
    - Can then use standard image recognition
    - Tried both BDT (scikit-learn) with engineered variables and deep learning (Keras) with jet images, comparing which was better
    - Approximately 750k free parameters in both cases
    - Deep networks outperform BDTs on engineered variables by a small amount
    - Both are much better than cuts on engineered variables
    - From checks, the NNs are learning the same higher-level features, especially the mass
    - The theorists have done a very good job of creating high-level variables that contain almost all the relevant information
    - Would be interesting to extend to 3D, as calorimeters are 3D images
    - Question, Rosa: you say NN has learned about the same as the BDT, but have you given the NN enough to surpass the BDT?
        - Dan: all simulation based, 10 million jets
        - Couldn't see any large differences with more statistics
    - Question, Sergei: what about time?  How long did these take?
        - Dan: Order of a day to train the NN on a cluster that has GPUs and similar
        - BDTs were about the same amount of time (no GPU)
        - But could probably get quite a bit of this performance out of much smaller architectures
    - Question, Sergei: are these feed-forward networks?
        - Dan: simple feed-forward neural networks
        - Three or four layers of locally-connected layers, then fully connected dense layers
    - Question, Steven: have you look at higher pT, most of the substructure variables work great in this semi-resolved regime but not at high pT when calo granularity becomes a limitation
        - Dan: have not looked at, but good idea for the future
    - Question, Mike: have you tried transverse energy instead of energy?  Energy is not boost invariant
        - Dan: another good thing to look at


Adam: Deep learning at NOvA
    - Long-baseline neutrino oscillation experiment
    - At NOvA, data taken in 550 microsecond intervals
    - Handscans are done to find neutrinos, mostly it's just cosmics/background
    - Clustering in space and time creates slices
        - Helps to separate nutrinos, cosmic rays, and noise
    - Traditional neural nets great, but have deficiencies
        - Doesn't scale very well to raw data
        - Important to reduce raw data down to few powerful well-engineered features
        - Failing to generalize to other data
        - Multi-layer functions can approximate function with fewer nodes than single-layer, good way to go
    - Rectified linear units (ReLU) have really made all of this possible
    - Convolutional neural nets are one powerful option
        - Many common kernels exist, but we want to learn optimal kernels from the data
    - Convolutional layers have a small number of weights shared across the image
    - Tried several different models by different computer vision groups for image classification contests
        - LeNet, AlexNet, VCG, and GoogLeNet
        - GoogLeNet performed the best and was faster
    - Converted inputs to images, only did hit clustering (no other reco)
    - Took relatively large boxes (80x100 pixels), 18.5m deep and 4m wide
    - Created Convolutional Visual Network (CVN), very strongly based on GoogLeNet
        - Two shorter versions of GoogLeNet running in parallel
    - Used several methods to avoid overtraining, detailed in slides
    - Ve: Obtained significant improvements (40%) in efficiency for same purity
        - Primarily due to improved efficiency in resonances and deep inelastic scattering
    - Vmu: minimal improvement as not much more to learn, but still a bit better, so not underperforming
    - GPU acceleration can train this in ~1 week, CPU would be much longer
    - Question, Dan: First step of processing was to project into two dimension, why not 3D?
        - Adam: data comes in 2D views.  You can construct 3D representations, but takes some reconstruction to do that
        - Trying to think of ways to sidestep these reconstruction issues by using the 2D data
        - Detector cells are stacked either horizontally or vertically, not both
        - The raw data is inherently two dimensional
        - Planes alternate, so can be done, but intrinsically is 2D
    - Question, Kyle: what is the reaction in the collaboration to this?  Loving/skeptical/mix?
        - Adam: Interesting from sociological point of view
        - This is six months of work
        - At first, uniform skepticism
        - First results (40% improvement), even more skepticism
        - Running the network on large number of systematically shifted samples, gains were not illusory, was robust
        - Data/MC disagreements in unblind near detector were about same as before
        - Now, no more detractors, full support from collaboration
        - Original detractors: "can't argue with results"
    - Question, Tim: Something that takes a week to train, how did you get started?
        - Did you find a way to learn the mistakes you were going to make in less time?
        - Adam: yes, week was for final training of production version
        - Good idea of heading in right direction within half a day
        - Would set it up running, submit to nodes interactively, and watch training and testing losses and testing accuracies
        - Usually pretty apparently fairly quickly that trends not going in a good direction

Amir: Deep learning reconstruction, LArTPC to ATLAS Calorimeter
    - Why do we want to go deep?
        - Better algorithms, hopefully outperforms hand crafted algorithms, unsupervised learning (anomaly detection etc)
        - Faster algorithms, can determine precision without going through all combinatorics, already parallelized
        - Easier development, feature learning vs feature engineering, save on development time and costs
    - Many problems in HEP software landscape
        - Complexity, future architectures, computation, expensive, and development and maintenance
        - Don't provide much support for software R&D
    - LBNF/DUNE will run for a long time, major project
    - Liquid Argon Time Projection Chambers (LArTPC)
    - Fundamentally read two dimensions, but can project into 3D with some care
    - Goal of 80% neutrino efficiency, current generation roughly half of this
    - Most neutrino physics: need to know flavour of the neutrino and the energy
    - User-assisted reconstruction is common for LArTPC reconstruction
        - Small datasets, so automated not needed in the past
        - Fully automatic reconstruction is now coming together
    - Did feasibility study with GoogLeNet with 224x224 pixel images
        - 90% electron efficiency for 2% fake rate with no real effort, blindly throwing data without optimization
        - This is better than has been seen before, so very good start
    - Have also studied custom DNNs of various types
    - Towards reconstruction with DNNs, starting with calorimeters
        - Essentially 3D image, very similar to LArTPC work
        - If we improve the ID and energy resolution, can make peaks stand out
        - Can also turn into generative model for fast simulation, save lots of CPU
        - Lots of potential uses in other places too, not even just calorimeters
    - Can consider putting all 200k cells of the calorimeter as an input to the DNN
    - Need very large samples, with full simulation, not in standard datasets
    - Open effort for general calorimetry studies, done outside of ATLAS (aimed at general LHC/ILC), all are welcome to join
    - Question, Steven: Handling different granularity of different parts of the calorimeter
        - Amir: Throw it into a network and see if it can figure it out
        - Thought about how to use convolutional neural network
        - Maybe 2D CNN on each layer, and then then have same level of granularity
        - Stack feature maps on top of each other
        - Different layers have to be somehow coupled together to represent the same feature
    - Question, Sergei: Calorimeters of today and calorimeters of tomorrow may not be the same
        - It would be interesting to see some R&D study with these standard calo designs
        - Some general wider future calorimeter
        - Amir: completely agree, good compromise as already in Geant4
        - So ATLAS is good place to start
        - Challenge is to be able to do this with the full community rather than just within an experiment
        - Need to be able to have a general discussion across experiments, not just bound by one
    - Question, Sergei: challenges on calo and DNN?
        - Dan: If possible better to have just public data/etc and get people to use the data and write papers
        - Amir: completely agree, people will optimize for that particular challenge
        - Kyle: several people signed idea of trying to make public simulators available for multiple uses
        - Talk at NIPs criticizing challenges and variations that have similar feel but avoid some of the traps that have been encountered
        - Lots of work for future data science for HEP is along path of making public simulators
        - Sergei: would really also want people within the collaborations working on this, not just external people like you mostly have in challenges