CMS Big Data Science Project

Europe/Berlin
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

PPD/ Round Table-WH11SE - Wilson Hall 11th fl South East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## 170222 - Big Data Meeting

* attendance: Alexey, Bo, Allia, Ian, Luca (and other CERN people), Viktor, Matteo, Jim, OLI

* jupyter notebook
    * don't use the notebook internal possibility to open a 2nd notebook file, instead start a 2nd notebook pyspark instance from the shell

* scala
    * needs compilation/packaging into jar
    * Luca's suggestion: create VM where we can have access to the analytix cluster and can install all software we need
        * rely on what is already on CVMFS from Swan project
        * Luca will work on providing the VM

* testing with hardware provided by Intel
    * provided by Intel to perform benchmarks
    * dedicated capacity, available for stress tests
    * includes various optimizations of software
    * otherwise no difference to analytix
    * copied 1.2 TB data from CMS open data
    * Victor is testing

* performance metrics / timing
    * history server of Spark could be a possibility
    * for now, Victor used the "spark listener" to get time stamps
        * measures the time every executor takes
        * need to build a jar and include it ➜ Victor's listener implementation is very specific
        * maybe we can have a central jar available for everyone to use, would need to extended and generalize Victor's implementation ➜ Jim and Victor?
    * Victor will send location of repository of his code
    * other performance metrics that we should measure
        * execution time, on executor and driver side
        * CPU
        * parallelism of execution

* discussing plan google docs
    * Swan is proposed to be the way of using the analytix cluster  because Swan is already working on the integration of all components
    * thrust 1
        * "make plots"
    * thrust 2
        * "perform fits"
        * possible setup:
            * starting from root files
            * 3 orders of reduction to data frames
            * store as numpy 
            * read it in with numpy to root
            * then use the CMS fitting infrastructure
* action items
    * Luca will create VM
    * Need generic method/code base to measure performance (keyword Spark Listener) ➜ Jim and Victor?

* plan:
    * Have Scala code in 2 weeks to start performance investigations with Illia ➜ Matteo and OLI

* Question: conversion vs. direct ROOT reading
    * Opinion is not to pursue conversion
    * How did we come to this decision ➜ convenience
        * conversion wouldn't have widespread adaption
    * Are we sure that by not converting we get the performance we need
        * argument: using conversion in spark is a lot better (using inbuilt parallelization)
    * Come back to this topic next time

* Next meeting in 2 weeks, March 8th

There are minutes attached to this event. Show them.
    • 16:00 16:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

      analytix tutorial overview:

       

      • setup: https://github.com/diana-hep/spark-root/blob/master/UserGuideSetupAnalytix.md
      • Jupyter notebook + histogrammer: https://github.com/diana-hep/spark-root/blob/master/ipynb/publicCMSMuonia_exampleAnalysis_wROOT.ipynb
      • Scala: https://github.com/diana-hep/spark-root/blob/master/UserGuideScala.md

      plan for today

      • Discuss analytix usage and need for scala compilation/packaging
        • Luca suggested to setup VM with all needed software with access to the cluster
      • Next steps
    • 16:05 17:00