CMS Big Data Science Project
PPD/ Round Table-WH11SE - Wilson Hall 11th fl South East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## 170208 - Big Data Meeting
* Status reports
* Saba
* January: submitted a paper to a workshop: HP Big Data Computing (Spark on NERSC)
* Submitted proposal for same use case on CORI II at NERSC. Was rejected, but NERSC team contacted Saba to use the use case to tune Spark on CORI II
* Need to submit queries and HDF5 layout to NERSC next week
* Plans to implement more use cases
* longterm: other experiments
* for now: more sophisticated analysis queries
* Saba wants to submit to Spark Summit East, deadline will be March/April/May
* Need more data ➜ Matteo will send data set lists for MINIAOD and help to get started copying more files
* Discussion about plans (following google doc linked to agenda)
* comments from Matteo
* reading ROOT from Spark directly very promising
* potential to have plots directly out of any ROOT ntuple
* schema is generated for input dynamically
* plotting package developed by Jim last year can be called directly, don't have to write out ntuple-like files
* some things are missing
* Victor is working on some libraries that are currently missing
* On top of MINIAOD we run CMSSW code to recluster jets, etc
* running CMSSW from python looks promising
* comments from Luca
* Victor is giving presentation in ROOT I/O workshop
* tested on 1 TB from HDFS
* in the next weeks, there will be some tuning of the code to directly access ROOT files from Spark
* proposal to ask Victor for next week to repeat ROOT I/O talk
* Reading directly from EOS into Spark is work in progress
* 1st step is the copy to HDFS, this is working
* We need to work on the 2nd step, reading from EOS directly