CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2016-08-17T10:00:00-05:00
End: 2016-08-17T11:00:00-05:00
Location: No location set

Wednesday 17 Aug 2016, 10:00 → 11:00 US/Central

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

* attendance: Illia, Cristina, Saba, JimK, Igor, Luca, Zigniew, Alexey, Bo, OLI

* news
* Today it is all about looking at the full analysis pass and the documentation

* pass through code at Princeton
* https://slack-files.com/T0LRAT4HF-F21RA250Q-5460548b47
* need to complete the documentation ➜ making plots in the end is not yet complete
* need to decide how to do fits at the end

* discussion how easy is it to reproduce this at CERN
* on-the-fly conversion vs. conversion and store AVRO in hdfs at CERN
* 3 possibilities
* Intel needs to replicate the workflow in their facilities

* questions:
* Why AVRO and not parquet? ➜ First question to ask after CHEP paper
* In the code, there is a lot usage of rdd’s, why not use data frames immediately? ➜ Good question, was easier to start wit rod’s.

* NERSC
* use the HDF5 files directly and not go through rdd’s
* all ROOT files are at NERSC
* working with HFD5Spark team why the scala API is not returning the correct data fields
* current bottleneck: reading data frames from HDF5 in Scala
* python converter form ROOT to HDF5 is doing just fine but is slow
* could do in C++
* but conversion is only been done once for now, can optimize later
* don’t want to start the conversion of everything till the full analysis pass on one file is complete and successful
* from analysis perspective, all looks encouraging

* Grace Hopper conference
* presentation is on October 21
* Saba will prepare a little time schedule when to have which ingredients for the presentation
* especially the physics part needs help from all of us
* target is to have a full working workflow at NERSC by September 21

* goal:
* analysis code
* make documentation available on TWiki or similar public space (Slack does not seem to work, links do not work)
* complete documentation, especially in the plotting part at the end
* everybody needs to start playing with the code!
* Princeton workflow
* replicate the workflow at any facility
* NERSC:
* complete HDF5 reading from Scala
* Discussion in 2 weeks: how do we compare spark with root? Metrics?

There are minutes attached to this event. Show them.

- 10:00 → 10:05
  
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
  
  Documentation how to run the analysis workflow

Choose timezone

CMS Big Data Science Project