Approximate peak attendance: 23 in room 30 in vidyo David Roussea - HiggsML - Once you release a challenge, it's out of your hands. People will solve the challenge, not necessarily your intended problem - XGBoost was created for the challenge - Invariant mass and similar variables are not necessarily findable with ML, expert input is beneficial - Cross-validation is very important (talk on this later in the meeting) - Optimizing area under ROC curve is not optimal but is what is usually done (want to focus on certain areas in most cases, not the full integral) - Lack of statistics is main reason why deep learning was not as beneficial as expected (~600k events, DL would benefit from 10x more) - BDT is good algorithm to start with, new software for BDTs is out there - Many techniques beyond training demonstrated their importance (cross validation in particular) - Question: dark knowledge - David: Very sophisticated neural network, trained in complex way - Can train simple net to mimic the complex one - For unclear reasons, this works better than directly training the simple one - Cecile: for the data as they are, very likely that deep neural nets will not add much with respect to BDTs - With much more raw data, this might change - Derived features from expertise of physics sees that deep neural nets don't help too much - Would be useful to collectively write some reflection on this - Gabor's remark was that he won this challenge not because of using deep neural nets, but because he's an expert in cross-validation - Question: systematics are not really considered? Model is not capturing data, so estimate model bias can estimate your systematic - David: In the end you want to be able to quote a number for your signal efficiency with a correctly evaluated systematic - Cecile: systematics more related to your MC not accurately representing your data - Comment in room: systematics are analysis by analysis based, no single prescription as depends on topology, where in detector, etc - When using raw cherenkov detector inputs, neural networks are much better - When using derived features, BDTs perform better Andrey: LHCb Flavors of Physics challenge - Goal: understand if efficiency of search for new physics can be improved using new ML ideas, evaluate classifier verification checks - Additional restrictions added to reflect physics analysis - Special prize for physics, awarded next week at NIPS - Challenge on Kaggle - First competition using additional checks prior to results reaching the leaderboard - Scripts allow for understanding where time is being spent - Seems that more time was spent trying to win by beating the system (breaking the checks) rather than solving the physics using novel methods - Created a metric using different weights for different bins/regions - Check was added to avoid mass correlation, want peaking to be from the signal not the background - Another check added to ensure that model behaves similarly in MC and data, uses D->phi+pi proxy channel which has both MC and real data events, run KS test - Problem: control and signal channels behave quite differently - You can train a classifier that differentiates between the two channels - Creates a means of exploiting the checks: learn to distinguish between signal and control, build a classifier on training data and exploit simulation freely - What can you do about this? - Ignore it and go on - Stop the challenge, try to fix the metric and data, restart the challenge - Inform participants about possible consequences (this option was followed) - Another problem: people figured out how to calculate the parent particle mass - Lots of meaningful ideas tried during the challenge, but not as much effort as hoped as people were mostly trying to get the top score (by beating the system) - Some tricks were found to cheat the checks you introduced, good to be aware of for future challenges - It's hard to make a completely hack-proof metric - Rule of thumb: if nothing works, use XGboost (within top 3 solutions) - Solutions will be presented at the Heavy Flavor Data Mining worksop in Zurich in February - Good checklist for organizers of future challenges on last slide Josh: future challenge ideas - HEP analysis typically has mix of high level and low level features - Exact division between regimes is a bit vague - MEM discriminators built from underlying matrix elements of signal and backgrounds - Relatively complex high level feature - Neural network + MEM combined can really improve the performance, shown by ATLAS at DS@LHC2015 - Possible challenge for signal/background discrimination with two/three variations - MEM+low-level features from simulation - Low-level features only - Low-level features plus faster to evaluate ML-based proxy for MEM (also discussed at DS@LHC2015) - Challenge can be based on existing ATLAS/CMS analysis or variation thereof - However, need to ensure that the MC provided has sufficient statistics - For this challenge, fast simulation is probably sufficient, need to understand tradeoff of huge MC set vs time spent working with MLs - Question: what is primary aim of challenge? - Josh: What is the amount of MC production you can replace with ML - Comment: Also good to understand how much info you destroy with this procedure as you are focusing on specific ME - Comment: MEM is primarily aimed at two very specific physics processes, such as SM Higgs analyses, doesn't directly translate into more generic search Andreas: tracking challenge - Asking for input on the challenge being planned, contact them with ideas - LEP: finding tracks --> fitting tracks --> classifying tracks - Global pattern recognition and conformal mapping has helped to make the patterns easier to find - This is unfortunately not as efficient as needed (and we don't live in ideal world) so not used in ATLAS and CMS - This would be good enough for 90%, but not for the last 10% where detector inhomegeneity/etc prevents its use - Instead: build seeds and hits, progressive kalman filter, make sure subsequent hits are compatible - Combinatorics explodes and becomes hard to handle - Enemies: random combinatorics creating fake tracks - Want to tackle the first step of finding the features with this challenge - Note that right now LHC experiments identifying tracks are not limited by our ability to do pattern matching, but the CPU for doing this is a massive burden - Doing a great job, just taking lots of time, will become untenable in the future - HEP pattern recognition techniques are 25+ years old, developed at the end of LEP - Are there better techniques out there? Challenge asks this question. - At the stage of determining where to define the challenge on the spectrum of ideal and reality - If we go too far toward ideal, conformal mapping is the ideal solution, but this doesn't work in practice - A few requirements that must be considered - Detector geometry - Simulation - Event data format (probably JSON) - Visualization tools - Well defined goal - Different categories of questions/requests - A rough detector design is in place, with ATLAS magnetic field structure to have inhomogeneities, but actual detector layout is neither ATLAS nor CMS - Simple web-based visualization display is now available - Technical track reco efficiency is close to 100% even at mu=200 - Fake rates at O(1%) in Run-1, aiming for the same at mu=200 - Need to very carefully define what is a good track and what is not (for scoring outcome) - Lots of work ongoing to determine a means of scoring tracks - Categories have not yet been defined - Training dataset is straightforward: simulate a training sample with the generic but realistic detector - Test dataset: only simulated hits, best solution is optimization of the function - Hope to finalize all but the categories and scoring/goals this month or January - Scoring function by Q1 2016 (discussion at connecting the dots) - Finalize categories and full set by ~April 2016 - Hope to start challenge by summer 2016 - Question: Balance between performance and computing - Andreas: If we can just call our existing identification software in the right regions, that's already a huge help - Question: how do you sell this challenge to people who don't know how tracking works? - Andreas: yes, we need a selling point - If we simulate black hole production with lots of tracks, it gives an analysis flavor and a fancy name to draw people in - Question: ATLAS and CMS are involved, what about LHCb and ALICE? - Open discussion, associated with the "Connecting the Dots" workshop - Pattern recognition is so bound to the geometry, so question is entirely different - Need to fix it to a particular representative geometry, chosen the full phase space tracking geometry model - Question: we already reconstruct great, should the metric focus on CPU? - Problem with pattern recognition: right classifier but infinite time, you have everything solved - Don't think we're looking for a more efficient pattern recognition (dominated by the detector) - Looking if our concepts of pattern recognition can be done in a feature extraction way - Question: What is the goal of the challenge itself? Information available in your hits as a function of pileup conditions is not the same. Can imagine that at low pileup the density of hits is small - Determination of some data with low pileup allows to probe magnetic field inhomogenities, would remain at high pileup - Andreas: in the end can provide training samples from single particle to mu=200, but define challenge for mu=200 - Focus is pattern recognition in high pileup - Question: don't have a precise in situ measurement of magnetic field, but your determination of this limits your potential at high mu too - Question: Do you have a way to measure the amount of CPU time people are taking? - Not really, at least not with Kaggle - Might be worth looking into other platforms - Question: Might be an example of where simplifying the problem is not the right solution, but rather stay complicated but define steps along the way in a series of challenges/stages where you build up to the final task - Agree and open to this idea, not yet sure of what path will be taken Sergei: future ideas - We've done lots of classification challenges, but what about regression challenges? - Instead of defining a metric for maximum separation gain, aim at minimal variance - Can think of as a clusterization problem: minimal intra-cluster distance, maximal inter-cluster distance - Already in use for photon/electron/b-jet measurements, can we do better? A regression-focused challenge could answer this - Another idea: multiple target regressions (fast simulation, etc) - Example: 14 input variables, 14 outputs to estimate, want a model that simultaneously describes your output - Past challenges have really focused on classification, we should avoid repetition - Potential future challenges can focus on aggression and both inspire new ideas, provide a different goal, and link more closely to what we do in physics - Question: what type of tool did you use to make the models? - Package named CLUS (?) - Will talk more about this in the next IML meeting - Question: We train on MC most of the time which are not perfect, cover differences with much larger systematics, maybe we could improve the systematics themselves - Yes, we are often overstating our errors Tim: Problem statements - Type of people who take part in competitions is often not the academic ML people - Maybe we should make something more like a problem statement where we are not sure that there is a solution - If you want to engage the more academic minded people, need to release very large amounts of data which is very easy to access (don't have to be part of the collaboration or fill out lots of paperwork, etc) - Difference between challenge and problem statement: long term undertaking, ~1 year, organize workshops along the way where people can meet and discuss - To attract academics, can have some type of proceedings or similar which can be published - If you are interested in this type of effort, contact Gilles and Tim - Lots of problems to circumvent (how to release tons of data?) - Need to have sufficiently ambitious problems, not something simple - Question: skeptical of the challenges we are currently planning - We simply don't have a large ML base at CERN - Lots of challenges is avoiding this problem, not solving the problem - Then wait a year to set up a challenge - Wait a year for results of the challenge - Take a year to implement the results and benefit - We do need more academics involved, this is a step in the right direction - Tim: agrees - Sergei: partly this working group is intended to work in this direction - Question: can have both, people who contribute to this and to challenges are likely different - Good way to have both parts of the community involved - How would you get to those academics in completely different fields and getting them involved - Tim: present or give a poster at NIPS or ICML or similar, rely on having an interesting enough question to get academic people interested and rely on word of mouth to spread it - Question: progressive challenges/events, two workshops (NIPS in december, ICML in ~July) - If manage to be organized and coordination involved in challenges, submit proposals together, which are the challenges of the year to propose, can progressively looking forward to and contributing to these challenges - There are people who do PhDs in ML and this type of process would be very beneficial to them - Isabelle: Problems can be hard, problem is when they are not well-posed or difficult to understand - Need to make sure the way the problem is posed is simple, but the problem itself can be very hard - Cecile: agrees completely, think the tracking challenge is a great idea for forming an entry-point into more long term research - Hard parts probably not only in the science, but in making the problem understandable for the ML community - Needs people dedicated to that - It will be easier and easier to understand each other as time goes on, but periodic workshop and associated brand is the best way IDIAP: BEAT platform - An open source web-based platform enabling reproducible software-based experiments - Certify results and prepare publication-ready assets - Provide access to data but also ensure privacy and confidentiality - Workflow: - Databases and specific usage protocols, done by administrators - Create workflows and algorithms to address this particular problem - The platform runs user toolchains and algorithms respecting usage protocols defined for your database - Results are stored and indexed, so you can create leaderboards and track advancements - Databases can be composed of any type of raw data and usage protocol(s) - Protocol defines how the data can be used (train, validate, test) - Data is hidden within platform, can be public or confidential - Only limitation is that the data must be python-readable - Toolchain: sequence of boxes to represent how the data flows from databases to the final analysis - Toolchains are versionable and trackable, can be branches, can start from previous work or the work of others - Algorithms can be written in any programming language, but currently only a python backend exists (can be expanded if needed, just not currently in place) - Experiments bring all of this together, only components which fit together can be combined - Toolchain is hybrid by design, can have python in one step and matlab in another automatically - Configurable parameters (email notification on completion, etc) - Easily reproducible, fully tracked, no hidden parameters - What unique features does BEAT provide? - Data is hidden within the platform, doesn't need to be made public, includes automatic caching (don't rerun things that aren't needed) - Unique result certification - Reproducible by design - Privacy by design (unless user specifically requests to share it) - Open source, you can create your own platform - Leaderboards and built-in tutorials are on the way - Please go ahead and try it, links on slide 10 (signup and forums) - Gave a quick demonstration of the platform showing features and navigation - Question: Does this keep history of everything you'e tried? - Yes, your account keeps track of every experiment you have tried - Also keeps track of everything shared with you - Question: what is the limitation of size? - The hard drive size - You can install the platform on your own servers, or use the existing servers - Can rank everything by computing time here, as you control the execution environment - When you run and experiment shared with you, you can fork it and run over a new dataset of your own Isabelle: ChaLearn - ChaLearn is a non-profit organization organizing challenges since 2005 - Very much academic oriented - Organize not to make money, but to advance science - Usually 3-4 challenges per year on ML in various domains, including HEP - Select the platform best suited to the problem of interest - If you want to run things at scale with huge data, privacy issues, etc, stop making sense to submit results but instead should submit code - This has been done for a number of challenges now - CodaLab platform similar in spirit to BEAT - Powered by MicrosoftCloud, meaning it's sponsored by Microsoft, don't need to pay for the compute time so far - AutoML: fully automated machine learning - Everyone has failed run 3 just recently - Question: What does failed mean? - Mostly it's failing due to memory management - We have limited computing and memory resources, so similar to the tracking challenge - We have decided to up the size and time now to progress, but hopefully this will change the approach to have a staged complexity algorithm - Challenge design: define, implement, attract, reward, harvest - Idea: setting a requirement for open source the code if you want the final prize, but let people join without that - allows for setting the bar high by companies with IP concerns, but then encouraging others by showing that something is possible - Types of submissions: result, code, or maybe virtual machine images - Beta testing is critical: data leakage (putting the answer into the problem, such as problem encountered in LHCb challenge) - A day or two earlier release to past participants/experts to let them find such problems is one possibility - Attracting people is critical, have one-day hackathons to teach people how to get into the problem - Making it a game is a great way to hook academics/etc and get them to keep submitting - Intermediate rewards - Constant things going on to keep up interest - Rewards with multiple smaller prizes is very important, more attractive to most people - Follow-up after the challenges! One of the biggest problems with most kaggle challenges - Workshops with a broader scope than the challenge are good to mix people and enriching the experience of everyone - Crowd-sourcing papers and proceedings are a good way to get more people involved - Analyses pros and cons of winners, next steps that weren't the core of the challenge - Question: Success stores of step 5 (harvest), as we have struggled with this - Grant proposals now include having one person working on this full time for a while - Can take 6 months, a year to run a new set of systematic experiments - Plan to have a post-doc or similar working on this - Comment: making challenges like a game is a very good strategy to increase involvement! Tom: Cross-validation - Want confidence that what you train is accurate/valid on unseen samples - One of the main aspects of the winning entry to the HiggsML challenge was his use of extensive cross-validation - Needed for model selection and performance estimation - Aside: Thinking of support vector machines in TMVA - Only closest points to decision boundary are used to define the separating hyper surface - Points can go in the wrong side of the plane, but can become quite computationally expensive - Tried multiple kernels (added kernels beyond standard TMVA) - In reality, while we have lots of data, out datasets in HEP are smaller than we would like - Splitting data into training and test samples: hold out technique - May not be able to reserve a large portion of data for testing, so hold-out may not be viable - Instead, k-fold cross-validation - Randomly sampled but independent, so no repetition, repeat k times - Uses whole dataset for both testing and training - How many folds should we use? - Large number is a good estimate of error rate - However, variance of the estimator is large and very computationally expensive - The opposite holds for small number of folds - The best choice depends on the dataset and how sparse it is - Common choices are 5 or 10, but rigorous use should be defined by the sample - Have been developing a tool to support the train->validate->test process - This is currently used on top of TMVA - Procedure of averaging many different validations of trainings gives improved performance on the best variant (with the current definition of how to take the average) - Question: integration within TMVA, new interface should be able to do at least 3 or 4 of those steps - The GUI has also been updated, if not it'd be good to know what limitations you are encountering