CMS Big Data Science Project
FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
* attendance: Illia, Cristina, Saba, JimK, Igor, Luca, Zigniew, Alexey, Bo, OLI
* news
* Today it is all about looking at the full analysis pass and the documentation
* pass through code at Princeton
* https://slack-files.com/T0LRAT4HF-F21RA250Q-5460548b47
* need to complete the documentation ➜ making plots in the end is not yet complete
* need to decide how to do fits at the end
* discussion how easy is it to reproduce this at CERN
* on-the-fly conversion vs. conversion and store AVRO in hdfs at CERN
* 3 possibilities
* Intel needs to replicate the workflow in their facilities
* questions:
* Why AVRO and not parquet? ➜ First question to ask after CHEP paper
* In the code, there is a lot usage of rdd’s, why not use data frames immediately? ➜ Good question, was easier to start wit rod’s.
* NERSC
* use the HDF5 files directly and not go through rdd’s
* all ROOT files are at NERSC
* working with HFD5Spark team why the scala API is not returning the correct data fields
* current bottleneck: reading data frames from HDF5 in Scala
* python converter form ROOT to HDF5 is doing just fine but is slow
* could do in C++
* but conversion is only been done once for now, can optimize later
* don’t want to start the conversion of everything till the full analysis pass on one file is complete and successful
* from analysis perspective, all looks encouraging
* Grace Hopper conference
* presentation is on October 21
* Saba will prepare a little time schedule when to have which ingredients for the presentation
* especially the physics part needs help from all of us
* target is to have a full working workflow at NERSC by September 21
* goal:
* analysis code
* make documentation available on TWiki or similar public space (Slack does not seem to work, links do not work)
* complete documentation, especially in the plotting part at the end
* everybody needs to start playing with the code!
* Princeton workflow
* replicate the workflow at any facility
* NERSC:
* complete HDF5 reading from Scala
* Discussion in 2 weeks: how do we compare spark with root? Metrics?