CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FCPA/ Dark Side-WH6NW - Wilson Hall 6th fl North West

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## Big Data meeting 160907

* attendance: Luca, Sbingiev, Igor, Bo, Matteo, Nhan, JimP, Alexey

* News
    * poster for SuperComputing was accepted
    * IEEE Big Data poster deadline was postponed till November (5 months!)
        * We could have submitted a poster
    * Alexey will maybe go to Big Data conference end of December
        * There is a co-hosted workshop with delayed deadlines
        * Alexey will submit a paper of other work using histogrammer
        * deadline is beginning of October
    * today, lets hear from Igor a short update on NoSQL
    * then lets talk about metrics

* JimP and Jin and Igor discussed the use case
    * approach of indexing by calculable parameters
    * skimming the datasets by these parameters
    * and then analyze the dataset further
    * at the conceptual stage
    * concentrating on indexing
    * data structures don’t translate into flat vector space
    * idea: dataset of 1B events: go straight to the events that you are interested in

* Status of comparison
    * we need the ROOT code, currently the ROOT code is not working, Matteo is fixing it for the test
        * not far from there
        * need to function on lxplus
    * Link to documentation of spark/scala code
        * https://twiki.cern.ch/twiki/bin/view/CMSPublic/PrincetonBigDataWorkflow
  
* Metrics
    * technical comparison
        * dummy root analysis (counting events) compared to spark (counting events)
        * full analysis code (use TTreeCache vs. Spark cached)
    * usability question
        * how much easier is it to do the analysis in spark vs. root
    * physics comparison
        * take established workflow
        * change basic code, like a cut
        * measure the time to change the cut
        * measure the time to produce a new physics plot with backgrounds and signal and everything
    * comparison: running both at 30 different operation points
        * question about normalization to lxbatch queue depth and number of running jobs vs. size of spark cluster
    * idea: Alexey: tune Spark parameter: partitioning
        * first tuning resulted in 15% performance improvement
            * reducing the number of partitions was beneficial
            * we can change the partition size without changes in the code
        * difference: ROOT partitioning is defined by the analyst (which for loop runs over which file list) while in Spark this is done by the system and independent of the analyst
    * idea: Igor: Cost (cost per million of events):
        * hardware is a good first step
        * need to track these properties when we run to be able to normalize running time by amount of hardware

* Overleaf for CHEP paper
    * OLI will set it up
    * please use versions often
    * if you use it in git mode, use versions as well
        * set snapshots
        * sometimes there are confusions when you use the mixed mode
        * git commits are not meaningful

* CMSSW writing out AVRO files
    * working example exist
    * started with complex structure from CMSSW event (event properties, jet collections, jets with data members)
    * a bit fragile, but it is a proof of principle
        * user would define a JSON that defines the data structures, considering using YAML
    * next: 
          * clean up repository and provide README
          * provide CMSSW wrapper (motivated by the fragil ness)
          * stepping stone to less fragile 
    * right now it is an EDAnalyzer, could be made into an OutputModule
        * talk to Chris Jones
    * AVRO C library needed to link into CMSSW
    * same could be used in principle for HDF5
    * will be the starting point for the discussion of restructuring the workflow after CHEP


* will write a separate paper with all information because the worry is that the CHEP paper has already too much information

* update on CERN workflow
    * not yet, C++ code work needs to happen first

* next meeting, next week, September 14

There are minutes attached to this event. Show them.
    • 10:00 10:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))