IML meeting: December 1, 2017
Peak people on Vidyo: 36
Peak people in the room: 12

Paul: Intro and news
- IML workshop will be April 2017
    - Core workshop April 9-11
    - Full-day hackathon will be April 12
    - This time, challenge will be announced a month in advance
- Next IML monthly meeting is January 26, 2017

- topic today: benchmark dataset
  - when reading image processing papers, it is nice that many models are tested on the MNIST dataset
  - can compare different models on the same benchmark without need to install and run them
  - HEP problems are different to actual photos
  - would be good to have common benchmarks instead of one (non public) benchmark dataset per experiment
  - today, technical topics:
     - accessibility of data for the public (w/o lxplus login)
     - accessibility for cern users (access through xrootd, /eos)
     - multiple file formats (can we have at the same time root for physicists and pandas/csv/hdf/… for the ML community)

David: Experience creating Higgs Machine Learning Challenge dataset
- Released full simulation Geant4
    - H->tautau signal with background mixture of Z, top, and W
    - 30 variables
    - flat data structure
    - 800k events (250k training set, 550k test data set)
- Important question on if you want to release full dataset, or keep part of it private (test overtraining)
    - Decided to deliberately release only a subset during the challenge
    - Released the rest at the end
- Deliberately omitted correction factors and similar so that this dataset couldn't be used for physics
- Real analysis vs challenge: simple, but not too simple!
    - Always a balance, want the problem to be simple but still interesting and useful
- Two spin-offs from this dataset
- Anomaly detection collaborative competition
    - Built from HiggsML dataset
    - Skewed dataset built also built from it, introducing small and big distortions
    - Goal: separate original dataset from skewed dataset
    - This was also re-used for teaching, something useful to keep in mind
    - Best score significantly improved in class compared to single-day competition
- Systematic spin-off
    - Focusing on impact of TES (tau energy scale) on the classification
    - Build benchmark for different systematics-aware training techniques
    - Had to use a different dataset in the end due to lack of statistics
        - Other dataset had other unfortunate limitations, but we needed the statistics
- Some thoughts
    - Important to simplify as much as possible so the data is usable almost immediately without auxiliary software
    - Provide a simple figure of merit for a typical task
    - Foresee different tasks
    - Keep richness in the dataset for flexibility and applicability to other unforeseen tasks
    - Add as much stats as possible to allow sophisticated methods to be tried (DNNs, etc)
- During Kaggle HiggsML competition, had a very active forum with questions being answered by the users themselves
    - Would be nice to have such a forum associated to the CERN Open Data portal
    - Has to be monitored, but ability for self-help from users is very beneficial
- Foreseen future datasets:
    - Tracking challenge to be run on Kaggle
        - HL-LHC simulation from ACTS (open-source spin-off from ATLAS tracking software)
        - Challenge will be on pattern recognition
    - ATLAS Open Data policy just updated, now we have a better mechanism to simplify release of dataset
    - First implementation will probably be G4 shower in ATLAS calorimeter
        - Computer vision for classification and regression, GAN simulation, etc
- Question (Paul): Nobody is perfect, when releasing people are worried you didn't simulate something correctly
    - Creates bad publicity
    - Did you encounter any fear or complaints in that direction?
    - David: Afraid of this, that's the reason we didn't try to have something accurate
    - Deliberately left on the side the small backgrounds, didn't try to reproduce normalization exactly
    - Don't try to be perfect, deliberately a bit wrong
    - The intentional problems don't have any impact on ML classification, but it makes it clear it is not a physics study
    - Andre: in the interest of science to publish this, we have enough internal reviews
    - Should come out with results that are not 100% perfect, but maybe 99%
    - David: at that time there was no review mechanism in place for that
    - With the new policy that ATLAS just adapter, make it clear that we don't want to release data for reproducing analysis
    - Objective is really for people to develop new algorithms
- Question (Sunje): suggestion of forum, considering for many years
    - It's a resource problem: If we have a forum, it needs a moderator
    - Have data releases from four LHC experiments and new experiments are foreseen
    - Need people from each experiment
    - Just lacking resources, suggestions of how would be appreciated
    - Sergei: not sure it is an absolute blocker, most of the info is among participants
    - Sunje: we have done AMA on Reddit for CERN Open Data, but we regularly have crackpots coming in
    - We need to make sure that doesn't happen, need to have some people who take care
    - Sergei: there is a difference, David's was in the context of a challenge, but you have more use cases than that
- Question (Sergei): coming back to final slide about statistics, amount of data in original set was insufficient
    - What is the statistics in the upcoming tracking challenge?
    - David: thinking of 1 million events corresponding to 10 billion tracks and 1TB of space
    - Maybe will just release simulator so people can generate what they want


Lars: Zenodo
- Don't do digital preservation yourself!
- Then the question is where should it be done?
- Zenodo? CERN Open Data?  Will try to cover both between this talk and the next
- Zenodo is simple, self-service
- You get a DOI for anything you upload
    - You can put this in your reference list, allows us to count citations correctly
    - Inspire, PubMed, etc use these
    - These are the primary keys for you, essential to have DoIs for citation analysis
- The DOI can move around, it's persistent, globally unique, has metadata behind it, etc
    - If it's lost in 50 years, still a record of what the dataset was even if the data is gone
- Zenodo mostly made for people *not* at CERN, so there is no CERN sign-on
    Instead, it has GitHub and ORCID logins, as well as manual login
- Hit the upload button, add your dataset, default is 50 GB/dataset
    - For CERN, can get a quota increase to TB sizes
    - Can upload any format, multiple files, you are in control
    - Any license can be added on top of your data
- Before uploading, think of a couple of things
    - What is a good file format?
    - Is it reusable?
    - What license should be used?
    - Cannot give you good advice, we just provide the service (CERN Open Data does provide help)
- In either case, you should start early
- Once uploaded files, can share it
    - Not only data, also software, presentations, videos, etc
    - Lots of linking options: can link software to data, etc
- Hit the publish button, online immediately, you get a DOI
    - Cannot edit files once uploaded, that's the point of a DOI
    - Can update the metadata (title, description, etc)
- Support an embargo period if you want it (data released later)
- Support restricted data (get a secret link, or fully closed)
- Versioning
    - Cannot edit files, but can create a new version
    - Means that you get a new DOI, as underlying dataset has changed
    - Important to track exactly which dataset was used in a given paper/etc
    - You can also cite the entire dataset (which lists all versions)
    - If you go to an old version, there is a notice saying that there is a new version available
- Communities can be made, this is all self service
    - People can all upload to that community
    - You can define a curator who decides what can/cannot be added to the community
- Integration with GitHub
    - Example shown in the slides
    - However, you can delete github repositories or similar
    - If you sign in with the github account and flip a switch, software can be copied to Zenodo with DOI
- Born out of EU project for people who don't have CERN infrastructure
    - However, many things fit in this type of repository
    - So used by people from all over
- Everything is HTTP access, no XRootD access despite all being stored on EOS
- Question (Paul): Said it's up to me to chose common format for my data, there are a few common choices
    - Can I upload multiple formats at the same time
    - Users can then pick the versions that they want
    - Lars: Yes, fully possible
    - Don't put too many files, limit of 100 files right now
    - Would have to click download on all of them
- Question (Paul): metadata is editable, does that include licensing?
    - I could then switch between difference licenses?
    - Lars: Yes, you can switch it around as much as you like
    - It's a mess, you're in charge of making sure that doesn't happen
    - It's really self service
- Question (Sergei): size limitations?
    - Lars: 50 GB/dataset, request quota from us if you want something bigger
    - 1TB is the largest we've done so far
    - Technically everything is on EOS, streams directly there
    - That's ultimately the limit, whatever EOS limits it
    - Accessed by HTTP, so need to be reasonable
    - If it's PB sized, then HTTP is not the way to go


Sunje: CERN Open Data
- Work closely with Lars who just presented
- Have quite a big release in 10 days (PB scale size), TB size is no problem at all
- Not a self service, goes through her team that helps to curate/prepare datasets for public
- Our audience to some extent is unknown
    - We do know about users from education side, have been collaborating with teachers
    - Communicating to prepare more educational exercises along with the data
    - First examples of Jesse Thaler from MIT using the data for publications
- Published 300TB of CMS data in April 2016
    - 210k distinct users visited the site, 66k users played with event display, etc
    - Huge interest
- CERN analysis data
    - Huge-scale
- Reana, CERN Open data
    - Processed data
- Also work closely with arXiv, inspire, and CDS
    - Final result data
- Demo of the new CERN open data, to be live 10 days from now
    - Provide filters and facets for searching through the data
    - Can adjust the data model according to needs, which would change how we present the dataset
    - Also comes with a DOI
    - Can version it under the global DOI
    - Key difference is we offer different data models depending on what you need
    - Often have a part on data validation and some more documentation
    - Provide tools to visualize data
- Providing answers to several questions we asked in advance
- Is it possible to tag datasets?
    - Yes, can search for example 7TeV, PbPb, MC, etc
- Ways of accessing the data
    - On EOS or by download
- DOIs provided for content they provide, but not for external datasets
    - They don't track external datasets, so can't guarantee immutability
    - Once DOI created, it's frozen, but can have related datasets (isParentOf, isSupersetOf, ...)
- Dataset size and formats
    - 1TB is not a problem, underlying storage is EOS
    - Format has no limitations, but keep in mind how to reuse it
    - Most datasets so far are in the ROOT format
- CERN Open Data in a nutshell
    - Particle physics oriented, customized data models
    - Tailored facets (energy, particles, ...)
    - LHC data so far, but also OPERA, perhaps LEP
    - No self-depositing, data prepared with expert curation
        - If you attend a conference the next day, it may not be public in time
        - Need to be done with some thought in advance
    - Optimized for big releases
    - Only open content, no user accounts, no access controls
    - Both HTTP and XRootD streaming (direct EOSPUBLIC access)
    - Tailored data model that we have may be beneficial for you
        - Zenodo metadata model can be a bit restrictive
- Zenodo and CERN Open Data all part of the same thing
- Question (Michele): any limitation if someone comes up with CDF dataset, can it go on CERN Open Data?
    - Sunje: not a problem at all, recently had discussions on this
    - Already working on expanding it to non-LHC experiments
    - Working with DESY to expand it to non-CERN labs
    - Discussing with everyone, we don't all need to repeat effort and setup the same thing
- Question (LongGang Pang): simulations of theory/phenomenology possible, can that be served in this open data or not?
    - Sunje: Without knowing the details I wouldn't say no
    - Also a discussion about HEPData and Rivet due to large demand from institutions/groups to share more
    - Quite a lot of discussion on who publishes what
    - MC can be quite big, so Rivet and HEPData often have problems, so maybe it will be us
    - Ultimate best location may need more time
- Question (Steven): is it possible to save software, such as the MC generation software, with the data?
    - This will let people generate more data if they need it, and so on
    - Sunje: Yes, we support software too, but don't have automated github integration
    - Could be doable as we have the same software backend as Zenodo
    - Lars: underlying module is installable on Open Data so you can get it from github
    - Would also be interesting to have gitlab, from the CERN side
    - Working on that integration now
- Comment (Sergei): we do intend to create additional datasets
    - However, already some we can use to maybe test out the pipeline
    - UCI repository for example has been around for a long time
    - We will hear from OpenML soon as another example
    - Some of the most used/referenced are the ones connected to computer vision
    - Not aiming to have one of those, but within our field we want something similar


Joaquin: OpenML
- Open source project by group of people who wanted to make ML really easy to use
- Started when I learned about the Sloan digital sky survey
    - Collecting all of the data about the universe into one survey
    - Very useful as people could answer lots of different questions using all of that data
    - Used ML to look for signatures of these black holes and found new ones as a result
- Inspired us to do something similar for ML
- Browse different datasets, see what works or not, what you can reproduce
- Want to make ML frictionless, open, accessible, collaborative, and automated
- Have APIs in python, Java, R, etc
- Can easily browse all of the datasets, about 10k right now
    - All automatically organized, annotated, etc
    - Can search for different properties (data-layer)
    - Task layer allows you to define what you want to do with that data
    - Experiment layer lets you try different algorithms and see how well they perform
    - Everything is then shared reproducibly
- Can easily upload data, use the web form with title+description, licenses, citation requests, etc
    - Can do the same thing through APIs (example: use local python script to upload your data to the server)
- Each dataset is automatically indexed, auto-versioned
    - Lots of different domains: biology, satellite data, etc
    - Click on any of them and see that every dataset has its own webpage with info
    - Also a wiki where you can add new information
    - Automatically analyze the feature distribution
- Can upload the data to their server, can also link existing dataset to OpenML
    - Can use dataset from Zenodo, etc
    - Registered via URL, transparent to users, API allows integrations (auto-sharing, DOIs, etc)
    - Don't issue DOIs ourselves, but can talk via API to others who do
    - Example of dataset from Zenodo: https://www.openml.org/d/40976
    - Can mirror datasets if necessary
- Data formats
    - Internally, all stored in ARFF, very popular ML format
    - APIs upload data arrays from R, python
    - Working on additional auto-conversions for common formats
    - Export to different formats is possible through APIs, including ROOT
- Datasets are auto-versioned so long as you upload with the same name
    - Extensions, corrections different structures, etc are thus supported
- Graphs show many different people working on a given dataset
    - Can see what people are doing, what works and doesn't work, etc
- Which algorithms can be run?
    - Offer integrations with most used tools: R, scikit-learn, XGBoost, etc
    - Also offer custom integrations with APIs
    - Graphical plugin for some, libraries for most
    - Example given in the slides of a python scikit-learn script interacting with OpenML
- Every run is completely reproducible
    - data, flows, authors, etc are all stored (auto-updated, evaluated online)
- Every experiment also gets its own webpage
    - Download the model, predictions, etc
- Lots of user pages like on github so you can see who uses which algorithms, how active they are, etc
    - Also tracks your contributions (uploads, when people copy your code, etc)
- Close to 4000 registered people now, 2500 regular users
    - Completely open, don't need to register
    - Most people register to collaborate on something or register experiments
    - Actual number of users is much larger
    - 1/3 academic, 1/3 industry, 1/3 students
    - Used in many different places and contexts
- Working on
    - Sharing, to do studies with certain people, etc
    - Studies, which is a counterpart of a paper
    - Integrations with jupyter, github, etc
    - Learning to learn (bots that learn from prior experiments to help people build models)
- Comment (Paul): thanks for the part on learning to learn, that's where these platforms become really interesting
    - Maybe I don't want the best classifier in the world, just want a good working one tomorrow
    - If the platform tells me a good place to start, that's very practical
    - Joaquin: thinking of creating something where you can give it a dataset and it will spend an hour trying to figure out the best it can do
- Question (Sergei): all of these other models that you can run on the data
    - Where are these run?
    - Joaquin: By default it runs client side, you locally train your model
    - in the end you publish your results to OpenML and we analyse them
    - This gives you a lot of flexibility and you can choose where to run
    - Some are on clusters, some laptops, etc
- Question (Sergei): Sometimes it makes sense to keep the data where it is
    - Should we think of having a local setup of OpenML to ask on-demand I want this model, that model, etc
    - If the data itself is here, maybe it would make sense
    - Do you know how much work that would be?
    - Joaquin: want to build models at CERN?
    - Sergei: yes
    - Joaquin: easiest way to just have a docker image
    - Script runs experiments on docker image
    - Bot just downloads data, runs experiments, uploads results
    - Could easily do this on CERN hardware as well
- Question (Sunje): first, nice website, checking it out while you were speaking
    - Five minutes ago, saw that many of the datasets you have come from UCI which Sergei was just talking about
    - Do you know what fraction of your data comes from external (UCI, Zenodo, etc)
    - Joaquin: A lot of datasets are uploaded by users, some have DOIs, many do not
    - Maybe 80% of UCI datasets also on OpenML
    - Some datasets are difficult to work with and some don't use them, so not on OpenML
    - Some open datasets from Kaggle, a few from Zenodo, many from projects (microbiology institutes, etc)
- Question (Sunje): looking at data citation recommendations, based on UCI and similar
    - Have you looked into more DOI based data-citation
    - Of course relates to UCI not having DOIs
    - Joaquin: something we want to improve on, right now it's quite free-form
    - We don't consistently require people to have a DOI or citation
- Question (Sergei): question on metrics, show figures of merit for same dataset/models
    - Do you mostly support single-metric, multiple, or what?
    - How do you support that the implementations of the metrics are the same so it is done consistently
    - Joaquin: that's why we validate the metrics on the server
    - If you use scikit-learn, will give you a result, but the one on the plots are all evaluated on the server
    - Support 30-35 metrics, all computed on all experiments
    - Paul: could I define my own metric, if I have something crazy that makes sense from a physics perspective?
    - Joaquin: can compute it with the run and upload it
    - If you want to do it for many experiments, can add a new evaluation metric to OpenML
    - Sergei: does the figure of merit calculation come with an uncertainty value
    - Joaquin: typically 10-fold cross-validation and then you have an uncertainty from the 10 results
    - Depending on the task, can have the probabilities per class


General discussion:
- Sergei: tracking challenge dataset, where is that planned to be stored?
- David: it has to be on Kaggle, when the challenge is done we can do what we want with it
    - Intend to upload it to CERN Open Data portal
    - Right now it's 1TB so we have to see
- Sunje: for me not entirely clear what you aim for
    - Very impressed with the OpenML presentation
    - For the Kaggle challenge, it's obvious I would consider what we are doing, putting it on CERN Open Data or similar
    - Main interaction may happen on OpenML afterwards, then Zenodo would be a good place to start with
    - Not clear which way you want to go
    - One of the use cases, want a DOI that is versioned
    - Either of the places can help you get the DOI, Inspire or others can track the DOI
    - If you want more of the interactive stuff, Kaggle or OpenML seem to cover that
- Sergei: Lots of good things that can be done with open data
    - The idea here is to have something not specific to any experiment, but rather generic datasets to be used to benchmark
    - Want to still be relevant to particle physics experiments
    - Idea to have these benchmark datasets to see how techniques work
    - Different variations, open it up to outside and have Kaggle competition
    - Other is within community, track progress on figures of merit relevant to us
    - Some areas where we want to encourage more activity
    - Whole point for us is to put together these datasets, maintain them, promote them 
- Michele: think we can go in an incremental way
    - Primary goal is to provide repositories which are easy to find, unique, provide DOI
    - Need to think and take a decision, Zenodo and CERN Open Data both seem good choices
    - Second level, set of interactive tools associated with the dataset
    - Can be used as starting point for people in the future, such as OpenML
- Lars: same comment I would have
    - compare the features of Zenodo or CERN Open Data
    - use that archive as a back-end
    - and then work with something like OpenML for community interactions
- Lars: One last comment we haven't discussed so much is the outreach part
    - Zenodo is kind of like everything self-serve
    - CERN Open Data really should gather everything open from CERN
- Sunje: Might want to consider community-specific features
    - For example, is XRootD relevant for the community
- Joaquin: combination of OpenML and Zenodo is very good one
    - We have APIs to link datasets to ML or data science tools in general
    - Would be very interested in exploring how we can integrate them closely
    - We could even tell people to put their datasets on Zenodo
- Michele: if anyone has additional thoughts on what service seems best or otherwise, send the IML coordinators an email
    - iml.coordinators@cern.ch
    - We intend to work towards a community repository very soon