CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL room: TaberNAcle-WH5E- Wilson Hall 5th fl East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## meeting

attendance: Illia, Zbigniev, Luca, Maria, Alexey, Bo, JimK, Nhan, Matteo, JimP, OLI

* Princeton thrust
    * code is not yet visible from outside
    * plan to repeat it at CERN
        * afternoon work to make it work at CERN
        * read root inputs on the fly
    * by end of the week, documentation is figured out
* NERSC thrust
    * question about documentation, coherent flow was not visible
    * need quick response time on questions to implement Princeton workflow at NERSC
    * not sure how much modification of the analysis code will be needed
        * conversion between row wise to column wise workflow
* NoSQL
    * JimP gave Igor some Bacon AVRO files to test and work with
    * Igor is talking to JimP
* Plans for CHEP
    * paper done by beginning of October
    * have 1 month
* Grace Hopper
    * 3 weeks after CHEP
    * needs to have a paper with pretty good results and performance measurements
    * by side of September need to have the coding done
* SC is November 20th

* Metrics
    * working at Princeton
    * technical comparison
        * dummy root analysis (counting events) compared to spark (counting events)
        * full analysis code (use TTreeCache vs. Spark cached)
    * usability question
        * how much easier is it to do the analysis in spark vs. root
    * physics comparison
        * take established workflow
        * change basic code, like a cut
        * measure the time to change the cut
        * measure the time to produce a new physics plot with backgrounds and signal and everything
    * comparison: running both at 30 different operation points


* Metrics
    * technical comparison of the different steps of the workflow: time and memory
        * JimK: problems with lazy evaluation
        * JimP: breaking it down into trivial tasks
            * dummy root analysis (counting events) compared to spark (counting events)
            * full analysis code (use TTreeCache vs. Spark cached)
    * usability comparison, user experience
        * changing one cut and proceeding the same physics plot (background and data)
        * changing a more significant piece of the analysis and comparing to produce a physics plot
        * optimization problem of either a cut or a plot (order of backgrounds) 
        * need objective and subjective metrics, also include how much time it takes to change the code
    * JimP: familiar with root, batch systems, spark, scala ➜ but no idea of the workflow
    * Nhan: different, familiar with analysis, root, batch systems; not familiar with Spark
    * Saba: runs Spark at NERSC
    * CERN: measure how hard it is to move the workflow to CERN

* Getting the data somewhere to analyze it is the hardest thing
    * CERN is thinking of how to improve and simplify data access from for example Spark/hadoop
        * can mount EOS
        * can even use CERNBox
        * working on that the hdfs command understands xrootd ➜ can this be open sourced? JimP is very interested, a fundamental building block

* Next meeting, next week

There are minutes attached to this event. Show them.
    • 10:00 10:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

      Plan for today:

       

      1. quick round table about coordination issues between different thrusts

      2. NoSQL: Talk about plans next meeting? Anything needed for next steps?

      3. Metrics discussion!