CMS Big Data Science Project

Europe/Berlin
31/S-028 (CERN)

31/S-028

CERN

30
Show room on map
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL: DIR/ Black Hole-WH2NW - Wilson Hall 2nd fl North West

CERN: 31-S-028

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

* attendance: Luca, Kazper, Vagg, Alexey, JimP, JimK, OLI

* ROOT from EOS in Hadoop 
    * finish meeting with engineers from other CERN-IT departments just before this meeting
    * getting close to have a first implementationi, something that will be working and can read ROOT files from EOS in Spark
    * estimate: 4 weeks
* Instructions to run reduction example
    * instructions from Vagg, code from Viktor
    * run from our VM
* Performance metrics discussion
    * document on webpage https://cms-big-data.github.io
        * add or change pages through this github repo: https://github.com/cms-big-data/cms-big-data.github.io-source (pages are in content/pages)
        * this uses travis-ci to create the page (takes some minutes after changes have been pushed)
    * Application metrics
        * primary metrics for reduction facility: 
            * How quickly can I reduce how many events?
                * depends on
                    * reduction factor
                    * size per event
                    * how much of the event is accessed during reduction (to make decision (skimming) and also to pass on to output (slimming))
                    *
    * System metrics: always aiming for a root cause analysis
        * memory usage and caching strategy
        * I/O metrics
        * spark inbuilt metrics
            * CPU time of all executors
            * time spent on garbage in garbage collection, time in serialization
            * from HDFS you get rows and data read from HDFS
        * measure network traffic, important for reading from EOS
    * Comment from Luca: reading parquet from Spark is normally CPU bound
        * JimP: compression? can you try without compression?
* discussion about comparing ROOT and spark
    * very difficult, as we saw from the CHEP paper exercise
    * JimP has numbers comparing c++ root reading and spark root reading
        * if you do C++ root correctly, root is 4 times faster than spark-root (C++ code and java code are doing exactly the same)
    * we could invite the root team to help us and optimize the root workflow

* todo list
    * add metrics for spark root reader, like the metrics you get from parquet

* round table item
    * Saba's and JimK's HDF5 investigations were presented at 
    * root4j and spark-root repositories not cleanly separated, JimP has a student that will work on branch to refactor root-I/O specific components in spark-root to move them into root4j
    * JimK and Saba will get a summer student to do the alternative implementation (numpy+pandsas+mpi), will be also using the tools coming out of the LDRD to convert into HDF5

* next meeting: June 21
    * Plan to ask Marc Paterno (FNAL) to present his LDRD (Lab Directed R&D) project converting ROOT files into HDF5 format optimally, JimK will talk to Marc
    * Luca will ask if Jacob Blomer and other SFT people would like to join
    * JimP and OLI will be at CERN

There are minutes attached to this event. Show them.
    • 16:00 16:10
      News 10m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
    • 16:10 17:00
      Discussion: Performance Metrics 50m