CMS Big Data Science Project
FNAL room: TaberNAcle-WH5E- Wilson Hall 5th fl East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## meeting
attendance: Illia, Zbigniev, Luca, Maria, Alexey, Bo, JimK, Nhan, Matteo, JimP, OLI
* Princeton thrust
* code is not yet visible from outside
* plan to repeat it at CERN
* afternoon work to make it work at CERN
* read root inputs on the fly
* by end of the week, documentation is figured out
* NERSC thrust
* question about documentation, coherent flow was not visible
* need quick response time on questions to implement Princeton workflow at NERSC
* not sure how much modification of the analysis code will be needed
* conversion between row wise to column wise workflow
* NoSQL
* JimP gave Igor some Bacon AVRO files to test and work with
* Igor is talking to JimP
* Plans for CHEP
* paper done by beginning of October
* have 1 month
* Grace Hopper
* 3 weeks after CHEP
* needs to have a paper with pretty good results and performance measurements
* by side of September need to have the coding done
* SC is November 20th
* Metrics
* working at Princeton
* technical comparison
* dummy root analysis (counting events) compared to spark (counting events)
* full analysis code (use TTreeCache vs. Spark cached)
* usability question
* how much easier is it to do the analysis in spark vs. root
* physics comparison
* take established workflow
* change basic code, like a cut
* measure the time to change the cut
* measure the time to produce a new physics plot with backgrounds and signal and everything
* comparison: running both at 30 different operation points
* Metrics
* technical comparison of the different steps of the workflow: time and memory
* JimK: problems with lazy evaluation
* JimP: breaking it down into trivial tasks
* dummy root analysis (counting events) compared to spark (counting events)
* full analysis code (use TTreeCache vs. Spark cached)
* usability comparison, user experience
* changing one cut and proceeding the same physics plot (background and data)
* changing a more significant piece of the analysis and comparing to produce a physics plot
* optimization problem of either a cut or a plot (order of backgrounds)
* need objective and subjective metrics, also include how much time it takes to change the code
* JimP: familiar with root, batch systems, spark, scala ➜ but no idea of the workflow
* Nhan: different, familiar with analysis, root, batch systems; not familiar with Spark
* Saba: runs Spark at NERSC
* CERN: measure how hard it is to move the workflow to CERN
* Getting the data somewhere to analyze it is the hardest thing
* CERN is thinking of how to improve and simplify data access from for example Spark/hadoop
* can mount EOS
* can even use CERNBox
* working on that the hdfs command understands xrootd ➜ can this be open sourced? JimP is very interested, a fundamental building block
* Next meeting, next week