CMS Big Data Science Project

Europe/Berlin
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

DIR/ Black Hole-WH2NW - Wilson Hall 2nd fl North West (TBC)

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

### Roundtable

* Saba
    * started to talk to NERSC about the skimming code and HDF5 part
    * Today will present in NERSC meeting about HDF5 and queries
    * Hoping for feedback and some suggestions to run this workflow at NERSC
    * Results for running code at NERSC are available, not sure if we can improve
* JimP
    * work with Jin and Igor
    * Query system based on CouchBase and JimP's way of accessing data column-wise
        * heavy caching
    * stopped language development for query language a month ago and concentrating on implementation
    * CouchBase implementation progressing, loaded all data into CouchBase, 1E6 queries per second
    * JimK concentrating on cache hits and how high that can go
        * KNL will overcome limitation from CPU to RAM
        * 7 GHz reached if it is in cache
        * MCD RAM of the KNL
        * without that, limited at 1 GHz
    * in parallel, developing a direct ROOT reader
        * can start with data directly from files in storage (EOS)
* Luca/Kacper
    * Intel has given access to test cluster for February
    * Idea is also to test CMS big data
    * copied Victor's data to the intel cluster
    * no progress in accessing root files directly from EOS, fellow starting March will work on this

### Spark and ROOT files presentation by Victor

* Spark is building schema before reading the data. It imposes constraints that all the data types must be known a priori to reading.
* Need to plan tests on Intel Lab Cluster ➜ email thread
* We want to concentrate on python and Jupyter ➜ python is important
* lets get histogrammer into the stack
* python + Jupyter + histogrammer + pyroot
* run Jupyter from lxplus and analytix, both have access to ROOT through CVMFS installed by the swan project

### Action items

* Thrust 1: prepare instructions for spark-root + python/Jupyter + histogrammer
    * run python script
    * use Jupyter notebook
* Thrust 2: data reduction facility: prepare instructions for spark-root + scala
    * use scala script
    * decide on output format
* Intel test cluster
    * Victor has a plan what to test
    * Need to get Matteo in the loop
    * email thread?

There are minutes attached to this event. Show them.