CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2016-08-31T10:00:00-05:00
End: 2016-08-31T11:00:00-05:00
Location: No location set

Wednesday 31 Aug 2016, 10:00 → 11:00 US/Central

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL room: TaberNAcle-WH5E- Wilson Hall 5th fl East

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

## meeting

attendance: Illia, Zbigniev, Luca, Maria, Alexey, Bo, JimK, Nhan, Matteo, JimP, OLI

* Princeton thrust
* code is not yet visible from outside
* plan to repeat it at CERN
* afternoon work to make it work at CERN
* read root inputs on the fly
* by end of the week, documentation is figured out
* NERSC thrust
* question about documentation, coherent flow was not visible
* need quick response time on questions to implement Princeton workflow at NERSC
* not sure how much modification of the analysis code will be needed
* conversion between row wise to column wise workflow
* NoSQL
* JimP gave Igor some Bacon AVRO files to test and work with
* Igor is talking to JimP
* Plans for CHEP
* paper done by beginning of October
* have 1 month
* Grace Hopper
* 3 weeks after CHEP
* needs to have a paper with pretty good results and performance measurements
* by side of September need to have the coding done
* SC is November 20th

* Metrics
* working at Princeton
* technical comparison
* dummy root analysis (counting events) compared to spark (counting events)
* full analysis code (use TTreeCache vs. Spark cached)
* usability question
* how much easier is it to do the analysis in spark vs. root
* physics comparison
* take established workflow
* change basic code, like a cut
* measure the time to change the cut
* measure the time to produce a new physics plot with backgrounds and signal and everything
* comparison: running both at 30 different operation points

* Metrics
* technical comparison of the different steps of the workflow: time and memory
* JimK: problems with lazy evaluation
* JimP: breaking it down into trivial tasks
* dummy root analysis (counting events) compared to spark (counting events)
* full analysis code (use TTreeCache vs. Spark cached)
* usability comparison, user experience
* changing one cut and proceeding the same physics plot (background and data)
* changing a more significant piece of the analysis and comparing to produce a physics plot
* optimization problem of either a cut or a plot (order of backgrounds)
* need objective and subjective metrics, also include how much time it takes to change the code
* JimP: familiar with root, batch systems, spark, scala ➜ but no idea of the workflow
* Nhan: different, familiar with analysis, root, batch systems; not familiar with Spark
* Saba: runs Spark at NERSC
* CERN: measure how hard it is to move the workflow to CERN

* Getting the data somewhere to analyze it is the hardest thing
* CERN is thinking of how to improve and simplify data access from for example Spark/hadoop
* can mount EOS
* can even use CERNBox
* working on that the hdfs command understands xrootd ➜ can this be open sourced? JimP is very interested, a fundamental building block

* Next meeting, next week

There are minutes attached to this event. Show them.

- 10:00 → 10:05
  
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
  
  Plan for today:
  
  1. quick round table about coordination issues between different thrusts
  
  2. NoSQL: Talk about plans next meeting? Anything needed for next steps?
  
  3. Metrics discussion!

Choose timezone

CMS Big Data Science Project