CMS Big Data Science Project
FCPA/ Dark Side-WH6NW - Wilson Hall 6th fl North West
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## Big Data meeting 160907
* attendance: Luca, Sbingiev, Igor, Bo, Matteo, Nhan, JimP, Alexey
* News
* poster for SuperComputing was accepted
* IEEE Big Data poster deadline was postponed till November (5 months!)
* We could have submitted a poster
* Alexey will maybe go to Big Data conference end of December
* There is a co-hosted workshop with delayed deadlines
* Alexey will submit a paper of other work using histogrammer
* deadline is beginning of October
* today, lets hear from Igor a short update on NoSQL
* then lets talk about metrics
* JimP and Jin and Igor discussed the use case
* approach of indexing by calculable parameters
* skimming the datasets by these parameters
* and then analyze the dataset further
* at the conceptual stage
* concentrating on indexing
* data structures don’t translate into flat vector space
* idea: dataset of 1B events: go straight to the events that you are interested in
* Status of comparison
* we need the ROOT code, currently the ROOT code is not working, Matteo is fixing it for the test
* not far from there
* need to function on lxplus
* Link to documentation of spark/scala code
* https://twiki.cern.ch/twiki/bin/view/CMSPublic/PrincetonBigDataWorkflow
* Metrics
* technical comparison
* dummy root analysis (counting events) compared to spark (counting events)
* full analysis code (use TTreeCache vs. Spark cached)
* usability question
* how much easier is it to do the analysis in spark vs. root
* physics comparison
* take established workflow
* change basic code, like a cut
* measure the time to change the cut
* measure the time to produce a new physics plot with backgrounds and signal and everything
* comparison: running both at 30 different operation points
* question about normalization to lxbatch queue depth and number of running jobs vs. size of spark cluster
* idea: Alexey: tune Spark parameter: partitioning
* first tuning resulted in 15% performance improvement
* reducing the number of partitions was beneficial
* we can change the partition size without changes in the code
* difference: ROOT partitioning is defined by the analyst (which for loop runs over which file list) while in Spark this is done by the system and independent of the analyst
* idea: Igor: Cost (cost per million of events):
* hardware is a good first step
* need to track these properties when we run to be able to normalize running time by amount of hardware
* Overleaf for CHEP paper
* OLI will set it up
* please use versions often
* if you use it in git mode, use versions as well
* set snapshots
* sometimes there are confusions when you use the mixed mode
* git commits are not meaningful
* CMSSW writing out AVRO files
* working example exist
* started with complex structure from CMSSW event (event properties, jet collections, jets with data members)
* a bit fragile, but it is a proof of principle
* user would define a JSON that defines the data structures, considering using YAML
* next:
* clean up repository and provide README
* provide CMSSW wrapper (motivated by the fragil ness)
* stepping stone to less fragile
* right now it is an EDAnalyzer, could be made into an OutputModule
* talk to Chris Jones
* AVRO C library needed to link into CMSSW
* same could be used in principle for HDF5
* will be the starting point for the discussion of restructuring the workflow after CHEP
* will write a separate paper with all information because the worry is that the CHEP paper has already too much information
* update on CERN workflow
* not yet, C++ code work needs to happen first
* next meeting, next week, September 14