CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2017-05-31T16:00:00+02:00
End: 2017-05-31T17:00:00+02:00
Location: CERN

Wednesday 31 May 2017, 16:00 → 17:00 Europe/Berlin

31/S-028 (CERN)

31/S-028

CERN

Show room on map

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL: DIR/ Black Hole-WH2NW - Wilson Hall 2nd fl North West

CERN: 31-S-028

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

* attendance: Luca, Kazper, Vagg, Alexey, JimP, JimK, OLI

* ROOT from EOS in Hadoop
* finish meeting with engineers from other CERN-IT departments just before this meeting
* getting close to have a first implementationi, something that will be working and can read ROOT files from EOS in Spark
* estimate: 4 weeks
* Instructions to run reduction example
* instructions from Vagg, code from Viktor
* run from our VM
* Performance metrics discussion
* document on webpage https://cms-big-data.github.io
* add or change pages through this github repo: https://github.com/cms-big-data/cms-big-data.github.io-source (pages are in content/pages)
* this uses travis-ci to create the page (takes some minutes after changes have been pushed)
* Application metrics
* primary metrics for reduction facility:
* How quickly can I reduce how many events?
* depends on
* reduction factor
* size per event
* how much of the event is accessed during reduction (to make decision (skimming) and also to pass on to output (slimming))
*
* System metrics: always aiming for a root cause analysis
* memory usage and caching strategy
* I/O metrics
* spark inbuilt metrics
* CPU time of all executors
* time spent on garbage in garbage collection, time in serialization
* from HDFS you get rows and data read from HDFS
* measure network traffic, important for reading from EOS
* Comment from Luca: reading parquet from Spark is normally CPU bound
* JimP: compression? can you try without compression?
* discussion about comparing ROOT and spark
* very difficult, as we saw from the CHEP paper exercise
* JimP has numbers comparing c++ root reading and spark root reading
* if you do C++ root correctly, root is 4 times faster than spark-root (C++ code and java code are doing exactly the same)
* we could invite the root team to help us and optimize the root workflow

* todo list
* add metrics for spark root reader, like the metrics you get from parquet

* round table item
* Saba's and JimK's HDF5 investigations were presented at
* root4j and spark-root repositories not cleanly separated, JimP has a student that will work on branch to refactor root-I/O specific components in spark-root to move them into root4j
* JimK and Saba will get a summer student to do the alternative implementation (numpy+pandsas+mpi), will be also using the tools coming out of the LDRD to convert into HDF5

* next meeting: June 21
* Plan to ask Marc Paterno (FNAL) to present his LDRD (Lab Directed R&D) project converting ROOT files into HDF5 format optimally, JimK will talk to Marc
* Luca will ask if Jacob Blomer and other SFT people would like to join
* JimP and OLI will be at CERN

There are minutes attached to this event. Show them.

- 16:00 → 16:10
  
  News 10m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
- 16:10 → 17:00
  
  Discussion: Performance Metrics 50m

Choose timezone

CMS Big Data Science Project

31/S-028

CERN