CMS Big Data Science Project

US/Central
Description

Fermilab room: OLI's office on WH11E

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

 

  • attendance: Saba, Alexey, Bo, JimP, Cristina, Matteo

 

 

  • O&C week puerto rico
    • 2 talks by Jim
    • 1 talk by OLI

 

  • Action items from last week
    • Cristina is going to calculate the amount of data for the first plot (signal region) we need to load DONE
      • various data and MC samples
      • data: 40 GB
      • MC: 80 GB
      • ➜ plan for 150 GB
      • would fit into RAM on Princeton cluster including de-compression
      • samples are in EOS at CERN ➜ plan is to transfer to Princeton and setup an xrootd server at Princeton
    • Alexey DONE
      • can we use the Princeton test cluster for all our tests and development ➜ in principle yes
      • need to figure out first access for JimP
      • then we need 5 additional accounts
      • need to request big data cluster account
      • either use VPN for Windows and OS X, for Linux have to ask JimP how to get access
      • Princeton working on easier access
    • OLI: DONE
      • create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
    • OLI: DONE
      • start chep abstract
    • Cristina and JimP DONE
      • setup github project for analysis code and other utilities
      • most of JimP’s work that is general will live in DIANA github project
    • Cristina, JimP and Matteo: still to be done
      • decouple dictionaries and BACON libraries from CMSSW

 

  • first milestone:
    • load BACON ntuples in hadoop+spark and produce a plot by Beginning of April
    • Structural
      • started with map, filter, reduce
      • flatmap: map on a list of lists
        • write a function that returns a list
        • map: you get a a list of lists
        • flatmap: get a list
        • flat map avoids having to use filter and map and calculate intermediate variables
    • currently discussing how to return the output best
      • flat map will filter and also reduce the event content
      • currently 80 variables
      • we could use parque (popular to save sql tables)
      • output after transformation is a flat table (currently it is not foreseen to have sub-structure)
      • plotting should be sql-like operation on the output of filtering/skimming
  • general structure of project
    • cluster is used for skimming/slimming converting ROOT files (can then also be persisted in cluster)
    • output is stored on cluster and then used for plotting (either on the cluster or interactively
  • plotting
    • matplotlib for now
    • for the future distributed histogram filling
      • matches with reduce, should be easy to put histograms into reduce code
    • Alexey: data frames don’t have histograms, RDD have distributed histograms
    • need to develop single interactive histogram mode and mode for many histograms (then we need to fill many histograms in a single or few reduce steps)
  • new milestones
    • 1st: define completely distributed skimming that reads root files and writes out to distributed parquet file(s) (https://parquet.apache.org/)
      • JimP and Cristina, Matteo, Alexey
      • move to scala
      • two weeks, planning to achieve by next meeting, April 13
    • 2nd: get the data from CERN to the Princeton Big Data cluster and setup xrootd server (JimP, Cristina, Matteo, Alexey)
      • two weeks, planning to achieve by next meeting, April 13
    • 3rd: plotting
      • four weeks
    • 4th: use official CMS data formats as input and not BACON ntuples
    • 5th: multi-user

 

 

  • action items
    • Saba: Improve first part of CHEP abstract
      • Jim’s input on story: Industry has caught up in size, but they are solving somewhat different problems (reduce step of map/reduce was never handled by physicists), solving slightly different problems
    • OLI: submit CHEP abstract to CMS and handle author list
    • Cristina, Matteo, JimP, and Alexey: define completely distributed skimming that reads root files and writes out to distributed parquet file(s) (MILESTONE1)
    • Cristina, Matteo, JimP, and Alexey: transfer the samples to Princeton and also setup an xrootd server at Princeton (MILESTONE2)
    • Cristina, JimP and Matteo: 
      • decouple dictionaries and BACON libraries from CMSSW
    • Alexey, JimP, Cristina, Matteo
      • switch to common big data github organization (no personal github accounts) for everything that is not DIANA ➜ Aleksey
    • Everyone:
      • write documentation at milestones so that stupid OLI can run it

 

  • next meeting:
    • April 13th
There are minutes attached to this event. Show them.
The agenda of this meeting is empty