CMS Big Data Science Project

Europe/Berlin
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## 170308 - 25th Big Data Meeting

* attendance: Alexey, Igor, Matteo, Kacper, Luca, Victor, Ian, Illia, Luca, Vaggelis, Saba

* news
    * Vaggelis: new CERN fellow ➜ Welcome!
        * first task: Reading ROOT files in Spark from EOS
        * Luca will setup meeting on Tuesday, March 14, afternoon, with Vaggelis, OLI, Kacper and himself
        * OLI will setup meeting with Vaggelis, Matteo, Jim, Kacper, Luca next week (maybe ~Thursday) to get to know each other
        * Evangelos has to be added to google groups, slack, etc.
    * Victor
        * results from Intel cluster
            * polishing results, then maybe report here
            * useful, good investment in characterizing performance on spark
                * working on making it more universal
                * to look into further: I/O performance characterization
        * IBM got started
            * small introduction
            * difficult in the beginning
                * they are using their own storage system
                * couldn't download dependencies from maven
                * resolved now
                * bottom line: able to read root files
            * Jim: Have we seen a DQM root file ➜ correction, will use real data files, bottom line, its going to be TTree (MINIAOD, AOD or RECO data tier)
                * no new technology needs to be developed for now

* Striped Event Project
    * presentation: http://tinyurl.com/jxl3t52
    * project goal
        * reduce time to physics, especially speeding up iterations
        * not tackling general computation like spark
        * tackling, aggregation, selection, ...
        * tackling interactivity and/or high turnaround
    * hardware:
        * CouchBase: old farm hardware, dataset uses 1.9TB
            * big portion can be in memory and immediately available
        * enginX web cache on SSD
        * client: old development machine with 16 cores
            * workers should not be transient, but permanent on own hardware
        * overall 2 layers of very fast data cache ➜ take advantage of stripes being cacheable easily 
    * event per group
        * using 1,000 for most datasets
        * 10,000 seems to be better, from small investigation
    * performance
        * cached 1M events: 1.5s
        * not cached: ~factor 10 slower
        * features:
            * one sample was efficiently stored with 10,000 events per group
            * other sample with 1,000 events per group
            * seeing a difference in performance
        * right now, we're in the MHz range
        * we can get to the GHz range by introducing memory cache in the workers and make them persistent
    * histograms
        * dynamically built
        * you can see the data being filled, you can stop and implement changes to correct problems
    * deployment
        * modular
        * plenty of variation of deployment within the same data center and cross-center
    * next steps
        * Jim is working on making the worker persistent and including memory cache
            * currently everything is in python, Jim is working on something more performant and suitable
        * looking at different backends and also the user laptop analysis use case
            * replace lowest two layers with local disk and keep the API the same (run remote and centralized and on own local computing should look the same)
        * virtual datasets
            * recalculate parts of a dataset
            * hide combining two datasets into one virtual dataset on the server ➜ allow to override parts of the data
        * skimming
            * incorporate skimming tools into the design

* Saba
    * collaborative work with NERSC
    * improve HD5 to Spark read process
    * Saba gave a presentation to NERSC on details
        * received couple of suggestions, following up
    * moved from Edison to Cori Phase 1 (Haswell architecture), significant speed up (5x)
    * first fine-tune on Cori phase 1 before thinking about Cori phase 2 (KNL)
    * submitted paper to workshop in January, got accepted, camera-ready version of the paper due March 22 ➜ will be presented in May

* Next meeting: March 22

There are minutes attached to this event. Show them.