CMS Big Data Science Project
DIR/ Fish Tank-WH13X - Wilson Hall 13th fl Crossover
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## 20th Big Data meeting - 161110
* attendance: Alexey, JimP, Matteo,
* Super Computing
* getting ready for SC, poster is getting ready
* good numbers for Edison for the full datasets
* final version will be uploaded on slack today
* worked on the slides for the demo
* mostly the talk from Grace-Hopper, added information for orchestration
* demo
* show slides
* get interactive time on Edison
* run job on Edison
* create data frame on the fly
* papers
* CHEP paper
* deadline January 27
* will include the result presented, maybe more
* HPDC paper
* deadline January/February
* target also another workshop: HP and Big Data Computing
* very relevant to NERSC workflow
* deadline January/February
* need to decide what work goes to what conference
* either split or decide to only do one
* started application for allocation at NERSC for 2017
* Saba wants to finish it this week
* discussion about how to continue after CHEP
* new member
* VictorK from U Iowa
* last year of graduate student time
* wants to get 2nd PhD in computer science
* specially interested in Scala
* few options to read root files into Spark -> settled
* settled on Java->Spark (VictorK does a lot of work)
* existing code is very well developed
* reads root files directly
* pure java reimplementation of root I/o, developed a long time ago
* needs a few tweaks to be modernized
* spark dataframe interface
* data frame is a view into root file, ntuple or classes (nested schema of arrays and structs)
* VictorK is getting it up to speed for flat and 2-level structures
* is being pushed to maven central
* maven coordinates
* because it is java, no need to have anything installed
* spark downloads everything including dependencies and inserts it into the session
* Diana-HEP GitHub
* root4j
* spark-root
* We can use the root files opened with java in scale-spark and py-spark equally
* NERSC
* will continue on HDF5 path and go MPI
* study file size correlation with performance on the NERSC file system
* then maybe check out the java root version
* analysis has two parts
* producing ntuples
* read in root files with spark-root
* writing out parquet or flat root, needs to be developed
* reading them and producing plots
* should do this in pyspark (should use mysql, performance is the same in Scala and python)
* or we could also just read in the root files and make a plot
* Matteo prefers this
* Scala vs. Python
* if we don’t have to introduce Scala, it would be an advantage for adoption
* if you work on RDDs, Scala performance is significantly better than python
* if you work on data frames (spark-root), python and Scala are comparable in performance
* python sends the description of the task to Scala
* pyspark does not have datasets
* you give spark an AST (abstract syntax tree) of an expression, spark optimizes the work plan (same as a database would optimize sql)
* we might not be able to convert Cristina’s code to AST’s using pyspark
* that’s the reason to keep a two-part structure
* would also demonstrate that flat-nutple in pyspark is fast
* next steps
* JimP and VictorK: get the root data frame reader working, root files from both HDFS and EOS (through xrootd protocol)
* replace AvroReader with RootReader in analysis code
* JimP is meeting with Luca at CERN to get it integrated at CERN
* question if security is integrated in the xrootd client in root4j, might be only able to do local xrootd
* JimP is working with another analysis use case, Mark is doing a WMass measurement in CMS
* he does not want to look at Scala
* Matteo and Alexey will start using the RootReader
* action items
* add VictorK to meeting invites
* next meeting
* Monday, November 21st, 2 PM
* Saba will give a SuperComputing update