CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

* attendance: Illia, Cristina, Saba, JimK, Igor, Luca, Zigniew, Alexey, Bo, OLI

* news
    * Today it is all about looking at the full analysis pass and the documentation

* pass through code at Princeton
    * https://slack-files.com/T0LRAT4HF-F21RA250Q-5460548b47
    * need to complete the documentation ➜ making plots in the end is not yet complete
    * need to decide how to do fits at the end

* discussion how easy is it to reproduce this at CERN
    * on-the-fly conversion vs. conversion and store AVRO in hdfs at CERN
    * 3 possibilities
* Intel needs to replicate the workflow in their facilities

* questions:
    * Why AVRO and not parquet? ➜ First question to ask after CHEP paper
    * In the code, there is a lot usage of rdd’s, why not use data frames immediately? ➜ Good question, was easier to start wit rod’s.

* NERSC
    * use the HDF5 files directly and not go through rdd’s
        * all ROOT files are at NERSC
    * working with HFD5Spark team why the scala API is not returning the correct data fields
    * current bottleneck: reading data frames from HDF5 in Scala
    * python converter form ROOT to HDF5 is doing just fine but is slow
        * could do in C++
        * but conversion is only been done once for now, can optimize later
        * don’t want to start the conversion of everything till the full analysis pass on one file is complete and successful
    * from analysis perspective, all looks encouraging

* Grace Hopper conference 
    * presentation is on October 21
    * Saba will prepare a little time schedule when to have which ingredients for the presentation
        * especially the physics part needs help from all of us
    * target is to have a full working workflow at NERSC by September 21

* goal:
    * analysis code
        * make documentation available on TWiki or similar public space (Slack does not seem to work, links do not work)
        * complete documentation, especially in the plotting part at the end
        * everybody needs to start playing with the code!
    * Princeton workflow
        * replicate the workflow at any facility
    * NERSC: 
        * complete HDF5 reading from Scala
    * Discussion in 2 weeks: how do we compare spark with root? Metrics?

There are minutes attached to this event. Show them.