IML meeting: January 26, 2018
Peak people on Vidyo: 56
Peak people in the room: 23

Steven: Intro and news
- Markus Stoye is the new CMS coordinator
- New ALICE coordinator will be announced soon
- IML annual workshop
    - Core workshop: April 9-11
    - Full-day hackathon: April 12
    - Call for abstracts is now open, due March 12
    - Call for hackathon project proposals will open soon
    - List of common questions and answers in the slides
- Next IML monthly meeting is February 28 on software+infrastructure

EP-IT data science seminar before the regular meeting
Soumith Chintala (Facebook): Automatic Differentiation and Deep Learning
recording at http://cds.cern.ch/record/2302087

Andrea Valassi: ROC curves, AUCs, and alternatives in HEP event selection and in other domains
- Got interested from LHCb challenge, where winner maximises the area under a ROC curve
- Now trying to understand what AUC really is, and why it is used in other domains
- Binary classifiers: true positive TP, false positive FP, true negative TN, false negative FN
- Different domains have different focus and terminology on these four cases
    - Prevalence = S/(S+B)
- ROC and PRC (precision recall) curves
    - ROC doesn't depend on prevalance, while PRC changes dramatically
- Domain-specific challenges
    1. Qualitative imbalance
    2. Quantitative imbalance
    3. Prevalence known?  Time invariance?
    4. Dimensionality?  Scale invariance?
    5. Ranking?  Binning?
- Medical diagnostics: optimise diagnostic accuracy
    - Different people may want to optimise different things
    - Most popular metric = (TP+TN)/(TP+TN+FP+FN)
    - Catch: super rare disease, 1 in a million, if I get it then the result doesn't change
    - For this reason, moved to ROC curve
    - AUC interpretation: probability that test result of random sick subject indicates greater suspicion than random healthy subject
    - Found further limitation that ROC isn't great for highly imbalanced datasets, moved to PRC
    - Active area of research, ROC AUC not always the best choice for medical diagnostics
- Information retrieval: distinction between relevant and non-relevant documents
    - Metrics evaluate classifiers based on the PRC
- HEP example
    - To minimize stat errors on cross-section measurement, maximize efficiency times purity
    - Eff*purity is qualitatively relevant, numerically nice, and is correct for cross-sections
    - However, it's not perfect for all situations, different cases want different metrics
    - In this type of case, AUC is irrelevant
- Binary classifiers in HEP
    - Event reconstruction --> (software) trigger --> physics analysis
    - TN are relevant for reco, but not trigger and physics analysis
    - TN enters definition of ROC and AUC, not good/relevant for trigger+analysis
- Physics is the only one of the three fields that uses binning (local efficiencies instead of global)
- Huge amount of additional details in the slides on different metrics
- Summary
    - Different disciplines/problems have different challenges, motivating different metrics
    - Most relevant metrics in HEP event selection are purity and signal efficiency
    - AUCs may not be the optimal choice (often not)
    - For every problem, we should identify the optimal metric
- Comment (Steven): At least in my cases, we may use ROC, but not typically AUC
    - Andrea: for particle ID or event selection?
    - Steven: I personally mostly do particle ID
    - Andrea: in that case ROC without AUC may make sense
    - Paul: to be truly optimal we need to know lots of things on the right of slide 17
    - for practical purposes, great advantage of the AUC, something that is computable fast
    - Andrea: started with Kaggle challenge and got confused
    - If you train people that the AUC is right, they will use it without thinking
    - It's a mind change, people will otherwise do things even if it doesn't make sense
- Question (Vitaliano Ciulli): slide 20, statistical idea, or really viable approach
    - Andrea: Really viable approach
    - You have a matrix element that depends on a parameter, can take the derivative
    - Can try to train an ML variable to reproduce that to see how this is spread in space
    - It can be done, when you have many dimensions of MC you try different variables
    - Try to map this to space distribution and do some calculations
    - More on the level of crazy ideas
- Question (Joosep Pata): true negative irrelevant, if we use same MC samples, then true negative well defined
    - Background is always the same
    - Andrea: Of course when using a sample, number is well defined
    - Take the trigger, if your numbers here are what comes out of L0, important thing is L1 rate
    - L1 rate must be same irrespective of L0 rate
    - When you do your physics publications, in the end it's only the TP, FP, FN that count
    - Joosep: when using data, yes.  When using MC for high-level Higgs analysis, it's not clear
    - Andrea: Interesting to try to think which are the variables you really need
    - In some cases, not exactly the same
    - Sometimes efficiency*purity, sometimes absolute scale, sometimes absolute # background selected (trigger rate)
    - Dimensionality of the thing is always usually 2 or 3
    - It's the true negative which is almost always there (the missing one of four)


Daniel Krefl: Riemann-Theta Boltzmann machine
- arXiv:1712.07581
- Boltzmann machine (BM)
    - Two-part system, hidden and visible sectors that are arbitrarily connected
    - Often binary valued states
    - In computer science, called energy-based models, as this is a statistical mechanics system
    - Probability of the system to be in a specific state is given by the Boltzmann distribution
- Practically, not feasible, for applications only restricted BMs (RBMs) have been considered
    - Removes the self-couplings
- If the self-couplings could be included, machines would be more powerful
    - Could model non-trivial covariances of the system
- Change the domain of state values
    - One set continuous, the other quantized
    - With some algebra, this can be calculated using a Riemann-Theta function
    - Still an infinite sum, but you can mathematically prove that for given precision you only need to sum a finite number of terms
    - You can thus evaluate the partition function efficiently
    - Gradients can also be calculated analytically
- Use theta function for neural network activation --> theta neural network
    - Each node learns its own activation function to model the system
    - This means you can learn using much smaller networks, as part of info goes into activation function itself
    - However, evaluation of the Riemann-Theta function is expensive (but is practically possible)
- Wrote new framework as they were really changing the basic building blocks
    - riemann.ai/theta
    - Very easy interface, inspired by keras
    - SGD and genetic optimizer out of the box
    - Easy to extend functionality (object oriented)
    - Currently CPU based, but working on GPU and FPGA support, and a better math back-end is in progress
    - Expected speedup will bring large-scale applications into reach
- Quick announcement of workshop from April 30 to May 4 in Hainan, China
    - Local costs covered, just need to buy the plane ticket
- Question (Steven): you said that the evaluation was more expensive, roughly how much?
    - Daniel: roughly factor of 10, but we think we can get most of this with the coming developments
- Question (Michela Paganini): do you believe these equations could be implemented in pytorch instead of building new one
    - Daniel: didn't look into pytorch as it wasn't as popular at the time
    - I would need to look into how deep you can go into the system
    - Here speed is critical, need to have activation functions running in C
    - If the framework is there to define custom activation functions in speedy way, can do this stuff in that framework
    - Our motivation was we wanted something where we had full control for R&D at the moment
    - If we decide to make this bigger, may want to move to something else
    - If we move this to another group, distribution is a problem, need to be big enough so that it is distributed with the main branch
- Question (Hossein Afsharnia): is this influenced by the distribution that we have over the input data
    - Daniel: have to distinguish between two things, BMs and TNNs derived from BMs
    - BMs are a device to learn underlying input density of data
    - TNN in turn is a new neural network layer with a new activation function
    - One sector which is visible is continuous, hidden sector encodes this state in a quantized space
    - So the input itself is continuous
- Question (Anton Poluektov): work as good feature detector, but how does training work?
    - Daniel: train in two steps
    - Learn probability density of each part of the picture in the first step
    - Then generate density for the feature vector for the second step



Savannah Thais: NIPS 2017 summary (HEP perspective)
- Largest ML conference, this year ~8000 participants (up from ~5000 last year)
- 800 accepted papers, 53 workshops, 9 tutorials
- Deep learning for physical sciences workshop
    - 30 accepted papers, 5 invited talks, 6 contributed talks
- Deep topology classifiers for a more efficient trigger at the LHC
    - Sequence of PFlow candidates fed into RNN, then process with LSTM or FRU
    - Images processed with a CNN
    - Initially to select ttbar events (currently dominated by W+jet and QCD with single-lepton trigger)
- Electromagnetic shower classification using a DenseNet
    - Outperforms other feature and cell based classifications
- Particle classification, energy regression, and simulation
    - DNN for classification using flattened cell information
    - Energy reconstruction using a CNN
    - Basic GAN for electromagnetic calorimeter generating 3D energy arrays
- Lots of jet contributions
    - Tips and tricks for training GANs with physics constraints
        - Nice summary of common issues when using GANs
    - DeepJet: generic physics object based multiclass classification 
        - CNN jet classifier using particle candidate features
    - Neural message passing for jet physics
        - Graph embedding of jets, outperforms previously studied RNN embedding
- Adversarial learning to eliminate systematic errors
    - Use adversarial learning to reduce systematics
    - Data augmentation, pivot adversarial network, and tangent propagation
    - DA and pivot outperform baseline, tangent does not
- Particle track reconstruction with deep learning
    - Image-based approach to track reconstruction
    - RNN with individual layers of detector and CNN with 3D image of full detector
    - Also looked at point-based ML with RNN to predict spacepoint in next layer
- A few simulation and modelling contributions
    - Improvements to inference compilation for probabilistic programming
        - Interfaces with existing scientific simulations
        - Example of interface with SHERPA
    - Graph memory networks for molecular activity prediction
        - Interesting RNN structure to model molecular behaviour
        - Standard RNN connected to matrix RNN (external memory)
    - Nanophotonic particle simulation and inverse design
        - Use NN to produce a range of measurements
        - Can also run the network backwards to design materials for a desired spectrum
- Physics-influenced ML: can incorporate QFT information into limiting models the algorithm will learn
    - How can physics inform deep learning methods
    - Towards a hybrid approach to physical process modelling
- Good attendance from both scientists and ML experts
    - HEP very well represented, half of organizers from CERN
- A lot of our LHC contributions use toy datasets
    - Good to see what is feasible, but ultimately want to actually use the techniques in our experiments
- A lot of work right now is classification problems, some with simulation
    - Lots of other interesting work being done in related fields that we can learn from
- Predictive neural networks
    - RNN where the basic block is a 3-mode tensor that computes combination of two input vectors
    - Show promise: much better accuracy in shorter amount of time than standard RNNs
- Interesting symposium on explainable machine learning
    - Hosted debate, and also a challenge
    - This is particularly important to HEP
- Lots of other interesting symposiums and workshops occurred, a good selection are listed in the slides
- Question (Hossein Afsharnia): kind of reinforcement learning, slide 14?
    - Savannah: Trying to combine insight on standard NNs with reinforcement representations, hybrid of the two


Lars Varming Joergensen: Spes Spirae
- Try to measure motor symptoms of Parkinson's disease
- Second most common neurodegenerative disease (after Alzheimers)
- Mostly affects older people, about 1% of people over 65
    - Some get it at a younger age, I was diagnose 1.5 years ago
- After diagnosed, was surfing the internet and trying to measure the symptoms of Parkinson's
    - Motor symptoms, nothing unseen
- No cure, medicine is for the symptoms, up to the patient what kind of medicine you want
- Want to understand these symptoms better
    - Tremors
    - Rigidity of movement
    - Slowness of movement
    - Postural instability
    - Sometimes sleeping problems and constipation, harder to measure
- Cause of disease: lack of dopamine of parts of the brain responsible for movement --> cells die
- By the time you are diagnosed, usually 70-80% of the dopamine producing neurons already gone
- Can't give replacement dopamine as it can't pass through the blood-brain barrier
- When started discussions with head of neurology department, question was "why do you think you can do better now"
    - Technology has changed, much easier and more comfortable to wear things 24 hours a day
- Hardware
    - Much better battery lifetime, makes possible to measure essentially 24/7
    - Very sensitive sensor package, as sensors are much better now
    - Simple design, preferably aesthetically pleasing
    - E-ink screen and buzzers to alert the patient to when to take pills or exercise (or when battery low)
    - Two buttons (green and red) allow to make possible to delay pill taking or exercises if moment is good or not
- Requirements
    - Should last 10-14 days between charging
    - All electronics must have low power consumption (no wifi/bluetooth)
    - Data downloaded from watch during charging
    - Slow download because of low power consumption memory
    - Download will probably take about 2 hours (5GB), but recharging will take a few hours too so it's ok
- Now that we have the data, need to ensure the patient has benefits
    - See their own data, are symptoms getting worse
    - Need two databases: one anonymised, one for the neurologist or patient
    - You can also use this to see if the new drug is actually working
        - One of the main reasons to buy this in the first place
    - Hopefully the fact that the patient gets a better treatment, the health insurance company may pay for the device
- Question of money
    - Retail probably 50-100 euros
    - This is really cheap, when I started treatment, a month of medicine was 98 euros, in USA same thing was $360
    - In Europe, price has since gone down to 65 euros, in US it's gone to $640/month
    - Patients in USA are therefore not taking drugs as they can't afford them
    - If device that costs 50-100 euro and can help get you the right drugs, it's not expensive
        - Health insurance company might be interested
- Big payoff for us is the anonymised database
    - Ask about medical history (medical surgeries, history of Parkinson's in family, etc)
    - Hopefully we could then unleash ML on this to try to extract new relationships
    - Could identify better progression markers for Parkinson's, maybe even help to find a cure
    - 5-10 million patients in the world, what if 100k patients were wearing our watch
- Next steps
    - Want to build first 20-40 prototypes
    - Check how well all of the software works
    - Likely will take about a year, then go into mass production
    - Might be able to use CERN IT infrastructure for the trial period
- Question (Hossein Afsharnia): working on the data that you have on this device
    - Is there any group working on this now?
    - Lars: Not that I am aware of
    - Hossein: I only hear about this data now, is it open source, can we work on it
    - Can we use it as a kind of pattern, which could be useful for future work on your advice
    - Lars: Idea of database is that it should be open source
    - Don't want to make it a priority that one group are the only ones who can touch the data
    - The more we can extract from the data, the better for the patients
- Question (Anton Poluektov): one of obvious solutions is release an app for apple watch
    - Lars: problem with that is the app takes some power as it runs continuously
    - It will drain the watch much too quickly
    - We have looked at this, at different watches, if it was possible
    - If we really want to do this, it has to be dedicated
    - Pebble, originally charge every 10 days, after app it was every 2 days
    - Anton: really taking 26 Hz continuously, then 5GB raw
    - Lars: yes, exactly
    - Paul: wearable, how you move at night, sounds like fitbit
    - Lars: completely closed off environment, don't let anything other than their software on it
    - Paul: yes, I mean you are on the right track, such a product can be accepted by the customer
    - Lars: someone told me last week that we should talk to Swatch, they are used to making things look nice