CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2016-03-30T10:00:00-05:00
End: 2016-03-30T11:00:00-05:00
Location: No location set

Wednesday 30 Mar 2016, 10:00 → 11:00 US/Central

Description

Fermilab room: OLI's office on WH11E

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

this meeting’s agenda: https://indico.cern.ch/event/514733/

attendance: Saba, Alexey, Bo, JimP, Cristina, Matteo

chep abstract
- https://docs.google.com/document/d/1nkp6MGJXyVwYgl_tiL_clscynaQT6XhaG_jMVJK0n2M/edit?usp=sharing
- author list: CMS, Alexey and JimP from Princeton, Saba and JimK from Fermilab

O&C week puerto rico
- 2 talks by Jim
- 1 talk by OLI

Action items from last week
- Cristina is going to calculate the amount of data for the first plot (signal region) we need to load DONE
  - various data and MC samples
  - data: 40 GB
  - MC: 80 GB
  - ➜ plan for 150 GB
  - would fit into RAM on Princeton cluster including de-compression
  - samples are in EOS at CERN ➜ plan is to transfer to Princeton and setup an xrootd server at Princeton
- Alexey DONE
  - can we use the Princeton test cluster for all our tests and development ➜ in principle yes
  - need to figure out first access for JimP
  - then we need 5 additional accounts
  - need to request big data cluster account
  - either use VPN for Windows and OS X, for Linux have to ask JimP how to get access
  - Princeton working on easier access
- OLI: DONE
  - create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
- OLI: DONE
  - start chep abstract
- Cristina and JimP DONE
  - setup github project for analysis code and other utilities
  - most of JimP’s work that is general will live in DIANA github project
- Cristina, JimP and Matteo: still to be done
  - decouple dictionaries and BACON libraries from CMSSW

first milestone:
- load BACON ntuples in hadoop+spark and produce a plot by Beginning of April
  - DONE!
  - plot: https://fermicloud-my.sharepoint.com/personal/gutsche_services_fnal_gov/_layouts/15/guestaccess.aspx?guestaccesstoken=zyGm1XlQ5xRCKzWPoGiMGi%2fDjFVYh6xBw%2f9WT4khRy4%3d&docid=02e2b9bb805bb441fb1ff474d711041ff
  - used ROOT file converted to JSON
  - used pyspark on laptop to develop filtering and skimming code
  - Jim created a class to treat each of the fields in the JSON file as objects ➜ very convenient
- Structural
  - started with map, filter, reduce
  - flatmap: map on a list of lists
    - write a function that returns a list
    - map: you get a a list of lists
    - flatmap: get a list
    - flat map avoids having to use filter and map and calculate intermediate variables
- currently discussing how to return the output best
  - flat map will filter and also reduce the event content
  - currently 80 variables
  - we could use parque (popular to save sql tables)
  - output after transformation is a flat table (currently it is not foreseen to have sub-structure)
  - plotting should be sql-like operation on the output of filtering/skimming
general structure of project
- cluster is used for skimming/slimming converting ROOT files (can then also be persisted in cluster)
- output is stored on cluster and then used for plotting (either on the cluster or interactively
plotting
- matplotlib for now
- for the future distributed histogram filling
  - matches with reduce, should be easy to put histograms into reduce code
- Alexey: data frames don’t have histograms, RDD have distributed histograms
- need to develop single interactive histogram mode and mode for many histograms (then we need to fill many histograms in a single or few reduce steps)
new milestones
- 1st: define completely distributed skimming that reads root files and writes out to distributed parquet file(s) (https://parquet.apache.org/)
  - JimP and Cristina, Matteo, Alexey
  - move to scala
  - two weeks, planning to achieve by next meeting, April 13
- 2nd: get the data from CERN to the Princeton Big Data cluster and setup xrootd server (JimP, Cristina, Matteo, Alexey)
  - two weeks, planning to achieve by next meeting, April 13
- 3rd: plotting
  - four weeks
- 4th: use official CMS data formats as input and not BACON ntuples
- 5th: multi-user

went through JimP’s root-converter example
- https://github.com/diana-hep/rootconverter/tree/master/spark-examples/commandline
- went through the example including persist
- discussion with Saba: can you use this in SparkR
  - problem was deeply nested structure, Saba has ascii dumper
  - JimP suggests to use the root-converter also in the SparkR case, should work and should be better

action items
- Saba: Improve first part of CHEP abstract
  - Jim’s input on story: Industry has caught up in size, but they are solving somewhat different problems (reduce step of map/reduce was never handled by physicists), solving slightly different problems
- OLI: submit CHEP abstract to CMS and handle author list
- Cristina, Matteo, JimP, and Alexey: define completely distributed skimming that reads root files and writes out to distributed parquet file(s) (MILESTONE1)
- Cristina, Matteo, JimP, and Alexey: transfer the samples to Princeton and also setup an xrootd server at Princeton (MILESTONE2)
- Cristina, JimP and Matteo:
  - decouple dictionaries and BACON libraries from CMSSW
- Alexey, JimP, Cristina, Matteo
  - switch to common big data github organization (no personal github accounts) for everything that is not DIANA ➜ Aleksey
- Everyone:
  - write documentation at milestones so that stupid OLI can run it

next meeting:
- April 13th

There are minutes attached to this event. Show them.

The agenda of this meeting is empty