IML meeting: December 1, 2017 Peak people on Vidyo: 36 Peak people in the room: 12 Paul: Intro and news - IML workshop will be April 2017 - Core workshop April 9-11 - Full-day hackathon will be April 12 - This time, challenge will be announced a month in advance - Next IML monthly meeting is January 26, 2017 - topic today: benchmark dataset - when reading image processing papers, it is nice that many models are tested on the MNIST dataset - can compare different models on the same benchmark without need to install and run them - HEP problems are different to actual photos - would be good to have common benchmarks instead of one (non public) benchmark dataset per experiment - today, technical topics: - accessibility of data for the public (w/o lxplus login) - accessibility for cern users (access through xrootd, /eos) - multiple file formats (can we have at the same time root for physicists and pandas/csv/hdf/… for the ML community) David: Experience creating Higgs Machine Learning Challenge dataset - Released full simulation Geant4 - H->tautau signal with background mixture of Z, top, and W - 30 variables - flat data structure - 800k events (250k training set, 550k test data set) - Important question on if you want to release full dataset, or keep part of it private (test overtraining) - Decided to deliberately release only a subset during the challenge - Released the rest at the end - Deliberately omitted correction factors and similar so that this dataset couldn't be used for physics - Real analysis vs challenge: simple, but not too simple! - Always a balance, want the problem to be simple but still interesting and useful - Two spin-offs from this dataset - Anomaly detection collaborative competition - Built from HiggsML dataset - Skewed dataset built also built from it, introducing small and big distortions - Goal: separate original dataset from skewed dataset - This was also re-used for teaching, something useful to keep in mind - Best score significantly improved in class compared to single-day competition - Systematic spin-off - Focusing on impact of TES (tau energy scale) on the classification - Build benchmark for different systematics-aware training techniques - Had to use a different dataset in the end due to lack of statistics - Other dataset had other unfortunate limitations, but we needed the statistics - Some thoughts - Important to simplify as much as possible so the data is usable almost immediately without auxiliary software - Provide a simple figure of merit for a typical task - Foresee different tasks - Keep richness in the dataset for flexibility and applicability to other unforeseen tasks - Add as much stats as possible to allow sophisticated methods to be tried (DNNs, etc) - During Kaggle HiggsML competition, had a very active forum with questions being answered by the users themselves - Would be nice to have such a forum associated to the CERN Open Data portal - Has to be monitored, but ability for self-help from users is very beneficial - Foreseen future datasets: - Tracking challenge to be run on Kaggle - HL-LHC simulation from ACTS (open-source spin-off from ATLAS tracking software) - Challenge will be on pattern recognition - ATLAS Open Data policy just updated, now we have a better mechanism to simplify release of dataset - First implementation will probably be G4 shower in ATLAS calorimeter - Computer vision for classification and regression, GAN simulation, etc - Question (Paul): Nobody is perfect, when releasing people are worried you didn't simulate something correctly - Creates bad publicity - Did you encounter any fear or complaints in that direction? - David: Afraid of this, that's the reason we didn't try to have something accurate - Deliberately left on the side the small backgrounds, didn't try to reproduce normalization exactly - Don't try to be perfect, deliberately a bit wrong - The intentional problems don't have any impact on ML classification, but it makes it clear it is not a physics study - Andre: in the interest of science to publish this, we have enough internal reviews - Should come out with results that are not 100% perfect, but maybe 99% - David: at that time there was no review mechanism in place for that - With the new policy that ATLAS just adapter, make it clear that we don't want to release data for reproducing analysis - Objective is really for people to develop new algorithms - Question (Sunje): suggestion of forum, considering for many years - It's a resource problem: If we have a forum, it needs a moderator - Have data releases from four LHC experiments and new experiments are foreseen - Need people from each experiment - Just lacking resources, suggestions of how would be appreciated - Sergei: not sure it is an absolute blocker, most of the info is among participants - Sunje: we have done AMA on Reddit for CERN Open Data, but we regularly have crackpots coming in - We need to make sure that doesn't happen, need to have some people who take care - Sergei: there is a difference, David's was in the context of a challenge, but you have more use cases than that - Question (Sergei): coming back to final slide about statistics, amount of data in original set was insufficient - What is the statistics in the upcoming tracking challenge? - David: thinking of 1 million events corresponding to 10 billion tracks and 1TB of space - Maybe will just release simulator so people can generate what they want Lars: Zenodo - Don't do digital preservation yourself! - Then the question is where should it be done? - Zenodo? CERN Open Data? Will try to cover both between this talk and the next - Zenodo is simple, self-service - You get a DOI for anything you upload - You can put this in your reference list, allows us to count citations correctly - Inspire, PubMed, etc use these - These are the primary keys for you, essential to have DoIs for citation analysis - The DOI can move around, it's persistent, globally unique, has metadata behind it, etc - If it's lost in 50 years, still a record of what the dataset was even if the data is gone - Zenodo mostly made for people *not* at CERN, so there is no CERN sign-on Instead, it has GitHub and ORCID logins, as well as manual login - Hit the upload button, add your dataset, default is 50 GB/dataset - For CERN, can get a quota increase to TB sizes - Can upload any format, multiple files, you are in control - Any license can be added on top of your data - Before uploading, think of a couple of things - What is a good file format? - Is it reusable? - What license should be used? - Cannot give you good advice, we just provide the service (CERN Open Data does provide help) - In either case, you should start early - Once uploaded files, can share it - Not only data, also software, presentations, videos, etc - Lots of linking options: can link software to data, etc - Hit the publish button, online immediately, you get a DOI - Cannot edit files once uploaded, that's the point of a DOI - Can update the metadata (title, description, etc) - Support an embargo period if you want it (data released later) - Support restricted data (get a secret link, or fully closed) - Versioning - Cannot edit files, but can create a new version - Means that you get a new DOI, as underlying dataset has changed - Important to track exactly which dataset was used in a given paper/etc - You can also cite the entire dataset (which lists all versions) - If you go to an old version, there is a notice saying that there is a new version available - Communities can be made, this is all self service - People can all upload to that community - You can define a curator who decides what can/cannot be added to the community - Integration with GitHub - Example shown in the slides - However, you can delete github repositories or similar - If you sign in with the github account and flip a switch, software can be copied to Zenodo with DOI - Born out of EU project for people who don't have CERN infrastructure - However, many things fit in this type of repository - So used by people from all over - Everything is HTTP access, no XRootD access despite all being stored on EOS - Question (Paul): Said it's up to me to chose common format for my data, there are a few common choices - Can I upload multiple formats at the same time - Users can then pick the versions that they want - Lars: Yes, fully possible - Don't put too many files, limit of 100 files right now - Would have to click download on all of them - Question (Paul): metadata is editable, does that include licensing? - I could then switch between difference licenses? - Lars: Yes, you can switch it around as much as you like - It's a mess, you're in charge of making sure that doesn't happen - It's really self service - Question (Sergei): size limitations? - Lars: 50 GB/dataset, request quota from us if you want something bigger - 1TB is the largest we've done so far - Technically everything is on EOS, streams directly there - That's ultimately the limit, whatever EOS limits it - Accessed by HTTP, so need to be reasonable - If it's PB sized, then HTTP is not the way to go Sunje: CERN Open Data - Work closely with Lars who just presented - Have quite a big release in 10 days (PB scale size), TB size is no problem at all - Not a self service, goes through her team that helps to curate/prepare datasets for public - Our audience to some extent is unknown - We do know about users from education side, have been collaborating with teachers - Communicating to prepare more educational exercises along with the data - First examples of Jesse Thaler from MIT using the data for publications - Published 300TB of CMS data in April 2016 - 210k distinct users visited the site, 66k users played with event display, etc - Huge interest - CERN analysis data - Huge-scale - Reana, CERN Open data - Processed data - Also work closely with arXiv, inspire, and CDS - Final result data - Demo of the new CERN open data, to be live 10 days from now - Provide filters and facets for searching through the data - Can adjust the data model according to needs, which would change how we present the dataset - Also comes with a DOI - Can version it under the global DOI - Key difference is we offer different data models depending on what you need - Often have a part on data validation and some more documentation - Provide tools to visualize data - Providing answers to several questions we asked in advance - Is it possible to tag datasets? - Yes, can search for example 7TeV, PbPb, MC, etc - Ways of accessing the data - On EOS or by download - DOIs provided for content they provide, but not for external datasets - They don't track external datasets, so can't guarantee immutability - Once DOI created, it's frozen, but can have related datasets (isParentOf, isSupersetOf, ...) - Dataset size and formats - 1TB is not a problem, underlying storage is EOS - Format has no limitations, but keep in mind how to reuse it - Most datasets so far are in the ROOT format - CERN Open Data in a nutshell - Particle physics oriented, customized data models - Tailored facets (energy, particles, ...) - LHC data so far, but also OPERA, perhaps LEP - No self-depositing, data prepared with expert curation - If you attend a conference the next day, it may not be public in time - Need to be done with some thought in advance - Optimized for big releases - Only open content, no user accounts, no access controls - Both HTTP and XRootD streaming (direct EOSPUBLIC access) - Tailored data model that we have may be beneficial for you - Zenodo metadata model can be a bit restrictive - Zenodo and CERN Open Data all part of the same thing - Question (Michele): any limitation if someone comes up with CDF dataset, can it go on CERN Open Data? - Sunje: not a problem at all, recently had discussions on this - Already working on expanding it to non-LHC experiments - Working with DESY to expand it to non-CERN labs - Discussing with everyone, we don't all need to repeat effort and setup the same thing - Question (LongGang Pang): simulations of theory/phenomenology possible, can that be served in this open data or not? - Sunje: Without knowing the details I wouldn't say no - Also a discussion about HEPData and Rivet due to large demand from institutions/groups to share more - Quite a lot of discussion on who publishes what - MC can be quite big, so Rivet and HEPData often have problems, so maybe it will be us - Ultimate best location may need more time - Question (Steven): is it possible to save software, such as the MC generation software, with the data? - This will let people generate more data if they need it, and so on - Sunje: Yes, we support software too, but don't have automated github integration - Could be doable as we have the same software backend as Zenodo - Lars: underlying module is installable on Open Data so you can get it from github - Would also be interesting to have gitlab, from the CERN side - Working on that integration now - Comment (Sergei): we do intend to create additional datasets - However, already some we can use to maybe test out the pipeline - UCI repository for example has been around for a long time - We will hear from OpenML soon as another example - Some of the most used/referenced are the ones connected to computer vision - Not aiming to have one of those, but within our field we want something similar Joaquin: OpenML - Open source project by group of people who wanted to make ML really easy to use - Started when I learned about the Sloan digital sky survey - Collecting all of the data about the universe into one survey - Very useful as people could answer lots of different questions using all of that data - Used ML to look for signatures of these black holes and found new ones as a result - Inspired us to do something similar for ML - Browse different datasets, see what works or not, what you can reproduce - Want to make ML frictionless, open, accessible, collaborative, and automated - Have APIs in python, Java, R, etc - Can easily browse all of the datasets, about 10k right now - All automatically organized, annotated, etc - Can search for different properties (data-layer) - Task layer allows you to define what you want to do with that data - Experiment layer lets you try different algorithms and see how well they perform - Everything is then shared reproducibly - Can easily upload data, use the web form with title+description, licenses, citation requests, etc - Can do the same thing through APIs (example: use local python script to upload your data to the server) - Each dataset is automatically indexed, auto-versioned - Lots of different domains: biology, satellite data, etc - Click on any of them and see that every dataset has its own webpage with info - Also a wiki where you can add new information - Automatically analyze the feature distribution - Can upload the data to their server, can also link existing dataset to OpenML - Can use dataset from Zenodo, etc - Registered via URL, transparent to users, API allows integrations (auto-sharing, DOIs, etc) - Don't issue DOIs ourselves, but can talk via API to others who do - Example of dataset from Zenodo: https://www.openml.org/d/40976 - Can mirror datasets if necessary - Data formats - Internally, all stored in ARFF, very popular ML format - APIs upload data arrays from R, python - Working on additional auto-conversions for common formats - Export to different formats is possible through APIs, including ROOT - Datasets are auto-versioned so long as you upload with the same name - Extensions, corrections different structures, etc are thus supported - Graphs show many different people working on a given dataset - Can see what people are doing, what works and doesn't work, etc - Which algorithms can be run? - Offer integrations with most used tools: R, scikit-learn, XGBoost, etc - Also offer custom integrations with APIs - Graphical plugin for some, libraries for most - Example given in the slides of a python scikit-learn script interacting with OpenML - Every run is completely reproducible - data, flows, authors, etc are all stored (auto-updated, evaluated online) - Every experiment also gets its own webpage - Download the model, predictions, etc - Lots of user pages like on github so you can see who uses which algorithms, how active they are, etc - Also tracks your contributions (uploads, when people copy your code, etc) - Close to 4000 registered people now, 2500 regular users - Completely open, don't need to register - Most people register to collaborate on something or register experiments - Actual number of users is much larger - 1/3 academic, 1/3 industry, 1/3 students - Used in many different places and contexts - Working on - Sharing, to do studies with certain people, etc - Studies, which is a counterpart of a paper - Integrations with jupyter, github, etc - Learning to learn (bots that learn from prior experiments to help people build models) - Comment (Paul): thanks for the part on learning to learn, that's where these platforms become really interesting - Maybe I don't want the best classifier in the world, just want a good working one tomorrow - If the platform tells me a good place to start, that's very practical - Joaquin: thinking of creating something where you can give it a dataset and it will spend an hour trying to figure out the best it can do - Question (Sergei): all of these other models that you can run on the data - Where are these run? - Joaquin: By default it runs client side, you locally train your model - in the end you publish your results to OpenML and we analyse them - This gives you a lot of flexibility and you can choose where to run - Some are on clusters, some laptops, etc - Question (Sergei): Sometimes it makes sense to keep the data where it is - Should we think of having a local setup of OpenML to ask on-demand I want this model, that model, etc - If the data itself is here, maybe it would make sense - Do you know how much work that would be? - Joaquin: want to build models at CERN? - Sergei: yes - Joaquin: easiest way to just have a docker image - Script runs experiments on docker image - Bot just downloads data, runs experiments, uploads results - Could easily do this on CERN hardware as well - Question (Sunje): first, nice website, checking it out while you were speaking - Five minutes ago, saw that many of the datasets you have come from UCI which Sergei was just talking about - Do you know what fraction of your data comes from external (UCI, Zenodo, etc) - Joaquin: A lot of datasets are uploaded by users, some have DOIs, many do not - Maybe 80% of UCI datasets also on OpenML - Some datasets are difficult to work with and some don't use them, so not on OpenML - Some open datasets from Kaggle, a few from Zenodo, many from projects (microbiology institutes, etc) - Question (Sunje): looking at data citation recommendations, based on UCI and similar - Have you looked into more DOI based data-citation - Of course relates to UCI not having DOIs - Joaquin: something we want to improve on, right now it's quite free-form - We don't consistently require people to have a DOI or citation - Question (Sergei): question on metrics, show figures of merit for same dataset/models - Do you mostly support single-metric, multiple, or what? - How do you support that the implementations of the metrics are the same so it is done consistently - Joaquin: that's why we validate the metrics on the server - If you use scikit-learn, will give you a result, but the one on the plots are all evaluated on the server - Support 30-35 metrics, all computed on all experiments - Paul: could I define my own metric, if I have something crazy that makes sense from a physics perspective? - Joaquin: can compute it with the run and upload it - If you want to do it for many experiments, can add a new evaluation metric to OpenML - Sergei: does the figure of merit calculation come with an uncertainty value - Joaquin: typically 10-fold cross-validation and then you have an uncertainty from the 10 results - Depending on the task, can have the probabilities per class General discussion: - Sergei: tracking challenge dataset, where is that planned to be stored? - David: it has to be on Kaggle, when the challenge is done we can do what we want with it - Intend to upload it to CERN Open Data portal - Right now it's 1TB so we have to see - Sunje: for me not entirely clear what you aim for - Very impressed with the OpenML presentation - For the Kaggle challenge, it's obvious I would consider what we are doing, putting it on CERN Open Data or similar - Main interaction may happen on OpenML afterwards, then Zenodo would be a good place to start with - Not clear which way you want to go - One of the use cases, want a DOI that is versioned - Either of the places can help you get the DOI, Inspire or others can track the DOI - If you want more of the interactive stuff, Kaggle or OpenML seem to cover that - Sergei: Lots of good things that can be done with open data - The idea here is to have something not specific to any experiment, but rather generic datasets to be used to benchmark - Want to still be relevant to particle physics experiments - Idea to have these benchmark datasets to see how techniques work - Different variations, open it up to outside and have Kaggle competition - Other is within community, track progress on figures of merit relevant to us - Some areas where we want to encourage more activity - Whole point for us is to put together these datasets, maintain them, promote them - Michele: think we can go in an incremental way - Primary goal is to provide repositories which are easy to find, unique, provide DOI - Need to think and take a decision, Zenodo and CERN Open Data both seem good choices - Second level, set of interactive tools associated with the dataset - Can be used as starting point for people in the future, such as OpenML - Lars: same comment I would have - compare the features of Zenodo or CERN Open Data - use that archive as a back-end - and then work with something like OpenML for community interactions - Lars: One last comment we haven't discussed so much is the outreach part - Zenodo is kind of like everything self-serve - CERN Open Data really should gather everything open from CERN - Sunje: Might want to consider community-specific features - For example, is XRootD relevant for the community - Joaquin: combination of OpenML and Zenodo is very good one - We have APIs to link datasets to ML or data science tools in general - Would be very interested in exploring how we can integrate them closely - We could even tell people to put their datasets on Zenodo - Michele: if anyone has additional thoughts on what service seems best or otherwise, send the IML coordinators an email - iml.coordinators@cern.ch - We intend to work towards a community repository very soon