CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## Meeting 160803

* attendance: Matteo, Alexey, Saba, JimK, Igor, Kacper, Luca, Bo, Illia, Cristina

* news
    * Introducing third thrust: NoSQL
        * Igor Mandrichenko will start investigating NoSQL databases and how our use case could be realized using them
        * We need to get him up to speed on the use case, help him get started with small test samples and show him how the analysis code works
    * Poster for SC
        * Was submitted in time on Friday, July 22nd, but in a very rudimentary form
        * It is not clear if we get the chance to update
        * In general, we need to be better prepared and for that, need to start earlier with the preparation
        * To all: if you would like to send abstracts or papers to conferences to present, please let us all know as early as possible so that we can help
    * CHEP
        * CHEP starts October 9th
        * By then we need to have a paper written that describes the use case realization in Spark and presents performance comparisons
        * The text of the paper will also be used to write a report for DOE
        * Should we start writing the paper? Overleaf? Yes
    * Mailing list
        * Is the slack forum enough to reach everyone?
        * We could have a google group in addition. Opinions?

* goals:
    * Princeton workflow
        * progress on full scale test ➜ we can get to plots
            * polishing the code
            * 2 things left, working with Jim and Alexey
            * running on smallest and largest sample
            * work on the plotting while Alexey is at FNAL
            * last test: persisted input file in Spark, first run takes 3 min, then 14 seconds
            * have to complete all samples
            * size of cluster: 4 service nodes and 6 worker nodes (138 virtual cores with hyper-threading enabled, 720 GB memory total
        * understand what to do for the next two weeks
        * writing documentation is becoming crucial that can enable people to run the workflow
        * plan to complete it by the next meeting
        * documentation how to use jupyter notebook in local browser and to connect it through ssh tunnels to the princeton spark instance
            * spark-enabled jupiter notebook is running on the cluster
            * ssh tunnel connects to it
            * browser displays
    * NERSC
        * Reader would be different
        * but then the scala analysis code is the same and can be reused
            * will use it from github
        * ROOT files have to convert to HD5 files
            * not yet ready to transfer all the files to NERSC
        * read HD5 files directly into spark program
        * working reading in HD5 into spark
            * many things are not convenient and even not working
        * existing SparkHD5 library work exhausted
            * not suited for our purposes
            * we have to write out own reader function
                * figuring out which layout is better suited for Spark
            * goal for next meeting: write reader function
        * scaling plan: ~1000 cores
    * NoSQL
        * question if database approach is the right approach for the use case
            * Igor: not wanting to implement it if it does not make sense
            * discussion that it’s then up to him to determine that
        * Matteo and will forward the use case document
    * CERN
        * after the documentation, we will work on running the same workflow at CERN
        * inputs to CERN
            * root 2 avro conversion on the fly
            * storing avro on HDFS
        * code:
            * can use the same code
    * after CHEP
        * right now we rely on input files that are custom
        * eventually we want to read the central CMS use case
        * redesigning the workflow completely
        * right now we are optimizing the book keeping needed when running the ROOT workflow 
        * replace that step to read official CMS files within CMSSW
        * need to ask us if we want to just adapt the workflow or if we want to completely redesign the workflow
        * format question has to be solved, do we need an intermediate step?
        * right now, cannot read MINIAOD directly
            * there is some c++ code that cannot be easily converted
        * end goal: 
        * at some time, we have to have discussion if Spark is the right technology
    * ROOT workflow performance numbers are not yet done, waiting for documentation, because we want to compare apples with apples

There are minutes attached to this event. Show them.
    • 10:00 10:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

          * Introducing third thrust: NoSQL
              * Igor Mandrichenko will start investigating NoSQL databases and how our use case could be realized using them
              * We need to get him up to speed on the use case, help him get started with small test samples and show him how the analysis code works
          * Poster for SC
              * Was submitted in time on Friday, July 22nd, but in a very rudimentary form
              * It is not clear if we get the chance to update
              * In general, we need to be better prepared and for that, need to start earlier with the preparation
              * To all: if you would like to send abstracts or papers to conferences to present, please let us all know as early as possible so that we can help
          * CHEP
              * CHEP starts October 9th
              * By then we need to have a paper written that describes the use case realization in Spark and presents performance comparisons
              * The text of the paper will also be used to write a report for DOE
              * Should we start writing the paper? Overleaf?
          * Mailing list
              * Is the slack forum enough to reach everyone?
              * We could have a google group in addition. Opinions?