CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

DIR/ Fish Tank-WH13X - Wilson Hall 13th fl Crossover

 

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## 20th Big Data meeting - 161110

* attendance: Alexey, JimP, Matteo, 

* Super Computing
    * getting ready for SC, poster is getting ready
        * good numbers for Edison for the full datasets
        * final version will be uploaded on slack today
    * worked on the slides for the demo
        * mostly the talk from Grace-Hopper, added information for orchestration
    * demo
        * show slides
        * get interactive time on Edison
        * run job on Edison
        * create data frame on the fly

* papers
    * CHEP paper
        * deadline January 27
        * will include the result presented, maybe more
    * HPDC paper
        * deadline January/February
    * target also another workshop: HP and Big Data Computing
        * very relevant to NERSC workflow
        * deadline January/February
    * need to decide what work goes to what conference
        * either split or decide to only do one

* started application for allocation at NERSC for 2017
    * Saba wants to finish it this week

* discussion about how to continue after CHEP
    * new member
        * VictorK from U Iowa
        * last year of graduate student time
        * wants to get 2nd PhD in computer science
        * specially interested in Scala
    * few options to read root files into Spark -> settled
        * settled on Java->Spark (VictorK does a lot of work)
        * existing code is very well developed
        * reads root files directly
        * pure java reimplementation of root I/o, developed a long time ago
            * needs a few tweaks to be modernized
            * spark dataframe interface
            * data frame is a view into root file, ntuple or classes (nested schema of arrays and structs)
    * VictorK is getting it up to speed for flat and 2-level structures
        * is being pushed to maven central
        * maven coordinates
        * because it is java, no need to have anything installed
        * spark downloads everything including dependencies and inserts it into the session
        * Diana-HEP GitHub
            * root4j
            * spark-root
    * We can use the root files opened with java in scale-spark and py-spark equally

* NERSC
    * will continue on HDF5 path and go MPI
    * study file size correlation with performance on the NERSC file system
    * then maybe check out the java root version

* analysis has two parts
    * producing ntuples
        * read in root files with spark-root
        * writing out parquet or flat root, needs to be developed
    * reading them and producing plots
        * should do this in pyspark (should use mysql, performance is the same in Scala and python)
* or we could also just read in the root files and make a plot
    * Matteo prefers this
* Scala vs. Python 
    * if we don’t have to introduce Scala, it would be an advantage for adoption
    * if you work on RDDs, Scala performance is significantly better than python
    * if you work on data frames (spark-root), python and Scala are comparable in performance
        * python sends the description of the task to Scala 
    * pyspark does not have datasets
        * you give spark an AST (abstract syntax tree) of an expression, spark optimizes the work plan (same as a database would optimize sql)
        * we might not be able to convert Cristina’s code to AST’s using pyspark
        * that’s the reason to keep a two-part structure
        * would also demonstrate that flat-nutple in pyspark is fast
* next steps
    * JimP and VictorK: get the root data frame reader working, root files from both HDFS and EOS (through xrootd protocol)
        * replace AvroReader with RootReader in analysis code
        * JimP is meeting with Luca at CERN to get it integrated at CERN
        * question if security is integrated in the xrootd client in root4j, might be only able to do local xrootd
    * JimP is working with another analysis use case, Mark is doing a WMass measurement in CMS
        * he does not want to look at Scala
    * Matteo and Alexey will start using the RootReader

* action items
    * add VictorK to meeting invites

* next meeting
    * Monday, November 21st, 2 PM
    * Saba will give a SuperComputing update

There are minutes attached to this event. Show them.
    • 11:00 11:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))