CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2016-09-07T10:00:00-05:00
End: 2016-09-07T11:00:00-05:00
Location: No location set

Wednesday 7 Sept 2016, 10:00 → 11:00 US/Central

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FCPA/ Dark Side-WH6NW - Wilson Hall 6th fl North West

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

## Big Data meeting 160907

* attendance: Luca, Sbingiev, Igor, Bo, Matteo, Nhan, JimP, Alexey

* News
* poster for SuperComputing was accepted
* IEEE Big Data poster deadline was postponed till November (5 months!)
* We could have submitted a poster
* Alexey will maybe go to Big Data conference end of December
* There is a co-hosted workshop with delayed deadlines
* Alexey will submit a paper of other work using histogrammer
* deadline is beginning of October
* today, lets hear from Igor a short update on NoSQL
* then lets talk about metrics

* JimP and Jin and Igor discussed the use case
* approach of indexing by calculable parameters
* skimming the datasets by these parameters
* and then analyze the dataset further
* at the conceptual stage
* concentrating on indexing
* data structures don’t translate into flat vector space
* idea: dataset of 1B events: go straight to the events that you are interested in

* Status of comparison
* we need the ROOT code, currently the ROOT code is not working, Matteo is fixing it for the test
* not far from there
* need to function on lxplus
* Link to documentation of spark/scala code
* https://twiki.cern.ch/twiki/bin/view/CMSPublic/PrincetonBigDataWorkflow

* Metrics
* technical comparison
* dummy root analysis (counting events) compared to spark (counting events)
* full analysis code (use TTreeCache vs. Spark cached)
* usability question
* how much easier is it to do the analysis in spark vs. root
* physics comparison
* take established workflow
* change basic code, like a cut
* measure the time to change the cut
* measure the time to produce a new physics plot with backgrounds and signal and everything
* comparison: running both at 30 different operation points
* question about normalization to lxbatch queue depth and number of running jobs vs. size of spark cluster
* idea: Alexey: tune Spark parameter: partitioning
* first tuning resulted in 15% performance improvement
* reducing the number of partitions was beneficial
* we can change the partition size without changes in the code
* difference: ROOT partitioning is defined by the analyst (which for loop runs over which file list) while in Spark this is done by the system and independent of the analyst
* idea: Igor: Cost (cost per million of events):
* hardware is a good first step
* need to track these properties when we run to be able to normalize running time by amount of hardware

* Overleaf for CHEP paper
* OLI will set it up
* please use versions often
* if you use it in git mode, use versions as well
* set snapshots
* sometimes there are confusions when you use the mixed mode
* git commits are not meaningful

* CMSSW writing out AVRO files
* working example exist
* started with complex structure from CMSSW event (event properties, jet collections, jets with data members)
* a bit fragile, but it is a proof of principle
* user would define a JSON that defines the data structures, considering using YAML
* next:
* clean up repository and provide README
* provide CMSSW wrapper (motivated by the fragil ness)
* stepping stone to less fragile
* right now it is an EDAnalyzer, could be made into an OutputModule
* talk to Chris Jones
* AVRO C library needed to link into CMSSW
* same could be used in principle for HDF5
* will be the starting point for the discussion of restructuring the workflow after CHEP

* will write a separate paper with all information because the worry is that the CHEP paper has already too much information

* update on CERN workflow
* not yet, C++ code work needs to happen first

* next meeting, next week, September 14

There are minutes attached to this event. Show them.

- 10:00 → 10:05
  
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Choose timezone

CMS Big Data Science Project