CMS Big Data Science Project
FNAL: DIR/ Black Hole-WH2NW - Wilson Hall 2nd fl North West
CERN: 31-S-028
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
* attendance: Luca, Kazper, Vagg, Alexey, JimP, JimK, OLI
* ROOT from EOS in Hadoop
* finish meeting with engineers from other CERN-IT departments just before this meeting
* getting close to have a first implementationi, something that will be working and can read ROOT files from EOS in Spark
* estimate: 4 weeks
* Instructions to run reduction example
* instructions from Vagg, code from Viktor
* run from our VM
* Performance metrics discussion
* document on webpage https://cms-big-data.github.io
* add or change pages through this github repo: https://github.com/cms-big-data/cms-big-data.github.io-source (pages are in content/pages)
* this uses travis-ci to create the page (takes some minutes after changes have been pushed)
* Application metrics
* primary metrics for reduction facility:
* How quickly can I reduce how many events?
* depends on
* reduction factor
* size per event
* how much of the event is accessed during reduction (to make decision (skimming) and also to pass on to output (slimming))
*
* System metrics: always aiming for a root cause analysis
* memory usage and caching strategy
* I/O metrics
* spark inbuilt metrics
* CPU time of all executors
* time spent on garbage in garbage collection, time in serialization
* from HDFS you get rows and data read from HDFS
* measure network traffic, important for reading from EOS
* Comment from Luca: reading parquet from Spark is normally CPU bound
* JimP: compression? can you try without compression?
* discussion about comparing ROOT and spark
* very difficult, as we saw from the CHEP paper exercise
* JimP has numbers comparing c++ root reading and spark root reading
* if you do C++ root correctly, root is 4 times faster than spark-root (C++ code and java code are doing exactly the same)
* we could invite the root team to help us and optimize the root workflow
* todo list
* add metrics for spark root reader, like the metrics you get from parquet
* round table item
* Saba's and JimK's HDF5 investigations were presented at
* root4j and spark-root repositories not cleanly separated, JimP has a student that will work on branch to refactor root-I/O specific components in spark-root to move them into root4j
* JimK and Saba will get a summer student to do the alternative implementation (numpy+pandsas+mpi), will be also using the tools coming out of the LDRD to convert into HDF5
* next meeting: June 21
* Plan to ask Marc Paterno (FNAL) to present his LDRD (Lab Directed R&D) project converting ROOT files into HDF5 format optimally, JimK will talk to Marc
* Luca will ask if Jacob Blomer and other SFT people would like to join
* JimP and OLI will be at CERN