CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2017-02-15T16:00:00+01:00
End: 2017-02-15T17:00:00+01:00
Location: No location set

Wednesday 15 Feb 2017, 16:00 → 17:00 Europe/Berlin

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

DIR/ Black Hole-WH2NW - Wilson Hall 2nd fl North West (TBC)

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

### Roundtable

* Saba
* started to talk to NERSC about the skimming code and HDF5 part
* Today will present in NERSC meeting about HDF5 and queries
* Hoping for feedback and some suggestions to run this workflow at NERSC
* Results for running code at NERSC are available, not sure if we can improve
* JimP
* work with Jin and Igor
* Query system based on CouchBase and JimP's way of accessing data column-wise
* heavy caching
* stopped language development for query language a month ago and concentrating on implementation
* CouchBase implementation progressing, loaded all data into CouchBase, 1E6 queries per second
* JimK concentrating on cache hits and how high that can go
* KNL will overcome limitation from CPU to RAM
* 7 GHz reached if it is in cache
* MCD RAM of the KNL
* without that, limited at 1 GHz
* in parallel, developing a direct ROOT reader
* can start with data directly from files in storage (EOS)
* Luca/Kacper
* Intel has given access to test cluster for February
* Idea is also to test CMS big data
* copied Victor's data to the intel cluster
* no progress in accessing root files directly from EOS, fellow starting March will work on this

### Spark and ROOT files presentation by Victor

* Spark is building schema before reading the data. It imposes constraints that all the data types must be known a priori to reading.
* Need to plan tests on Intel Lab Cluster ➜ email thread
* We want to concentrate on python and Jupyter ➜ python is important
* lets get histogrammer into the stack
* python + Jupyter + histogrammer + pyroot
* run Jupyter from lxplus and analytix, both have access to ROOT through CVMFS installed by the swan project

### Action items

* Thrust 1: prepare instructions for spark-root + python/Jupyter + histogrammer
* run python script
* use Jupyter notebook
* Thrust 2: data reduction facility: prepare instructions for spark-root + scala
* use scala script
* decide on output format
* Intel test cluster
* Victor has a plan what to test
* Need to get Matteo in the loop
* email thread?

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
  
  Planning Google Docs
  Openlab framework and project agreements have been sent to the DOE for signature (we expect questions ... )
  
  Plan for today's meeting
  
  Quick roundtable
  
  Viktor will introduce how to access ROOT files from Spark and have a demo
  
  Planning discussion
- 16:05 → 16:30
  
  ROOT files and Spark 25m
  
  Speaker: Viktor Khristenko (The University of Iowa (US))
  
  rootio_08022017.pdf