CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

# Meeting 160706

attendance: Zbigniew, Luca, Alexey, Illa, Cristina, Bo, OLI, Matteo

agenda: <https://indico.cern.ch/event/549221/>

## News

* CHEP abstract was accepted as oral presentation (12+3)
* August 1st to August 5th: Alexeys will come to FNAL.

## Notes

* quick introduction for Zbigniew and Luca
    * discussion about AVRO, JSON and parquet 
    * Action item for OLI: sent introduction information to Zbigniew and Luca
* Luca and Zbigniew:
    * running CERN-IT hadoop service
    * also in openlab
    * fellow is coming in Fall to help with this another use cases
* Princeton Workflow:
    * last week met with JimP to finalize Princeton workflow
        1. filtering step is 95% complete, just need to check 
            * scale factors is not yet done
            * plan to convert all scale factors into JSON and use this as a common input format
            * b-tag scale factors is a csv, easy to convert into JSON
        2. save all information in parquet file, created class and functions
            * need to complete this step
        3. histogramming, we’re going to use histogrammar (is in very good shape)
            * did some preliminary plots
            * new histogrammer version has stack plots
            * intensify interaction with Alexey about histogramming
    * list of things to complete
        1. complete the list of things that we are saving in the parquet format
        2. complete scale factor treatment
        3. implement plots using histogrammer
    * goal:
        * didn’t reach goal to have full scale test for this meeting
        * looks good for in 2 weeks
* histogrammer
    * from the beginning, it is design to aggregate data into bins using Spark actions
    * wrote tutorials, Cristina is following them
    * asking about default error bars, talking with JimP, Alexey doing some more development
    * started working profiling and optimization of histogrammer
        * JimP used numpy
        * Alexey used intel python ➜ 10% improvement on top of optimized version
        * lets post information in slack channels so that everyone can read them

There are minutes attached to this event. Show them.
The agenda of this meeting is empty