CMS Big Data Science Project
FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
# Meeting 160706
attendance: Zbigniew, Luca, Alexey, Illa, Cristina, Bo, OLI, Matteo
agenda: <https://indico.cern.ch/event/549221/>
## News
* CHEP abstract was accepted as oral presentation (12+3)
* August 1st to August 5th: Alexeys will come to FNAL.
## Notes
* quick introduction for Zbigniew and Luca
* discussion about AVRO, JSON and parquet
* Action item for OLI: sent introduction information to Zbigniew and Luca
* Luca and Zbigniew:
* running CERN-IT hadoop service
* also in openlab
* fellow is coming in Fall to help with this another use cases
* Princeton Workflow:
* last week met with JimP to finalize Princeton workflow
1. filtering step is 95% complete, just need to check
* scale factors is not yet done
* plan to convert all scale factors into JSON and use this as a common input format
* b-tag scale factors is a csv, easy to convert into JSON
2. save all information in parquet file, created class and functions
* need to complete this step
3. histogramming, we’re going to use histogrammar (is in very good shape)
* did some preliminary plots
* new histogrammer version has stack plots
* intensify interaction with Alexey about histogramming
* list of things to complete
1. complete the list of things that we are saving in the parquet format
2. complete scale factor treatment
3. implement plots using histogrammer
* goal:
* didn’t reach goal to have full scale test for this meeting
* looks good for in 2 weeks
* histogrammer
* from the beginning, it is design to aggregate data into bins using Spark actions
* wrote tutorials, Cristina is following them
* asking about default error bars, talking with JimP, Alexey doing some more development
* started working profiling and optimization of histogrammer
* JimP used numpy
* Alexey used intel python ➜ 10% improvement on top of optimized version
* lets post information in slack channels so that everyone can read them