CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2017-02-22T16:00:00+01:00
End: 2017-02-22T17:00:00+01:00
Location: No location set

Wednesday 22 Feb 2017, 16:00 → 17:00 Europe/Berlin

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

PPD/ Round Table-WH11SE - Wilson Hall 11th fl South East

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

## 170222 - Big Data Meeting

* attendance: Alexey, Bo, Allia, Ian, Luca (and other CERN people), Viktor, Matteo, Jim, OLI

* jupyter notebook
* don't use the notebook internal possibility to open a 2nd notebook file, instead start a 2nd notebook pyspark instance from the shell

* scala
* needs compilation/packaging into jar
* Luca's suggestion: create VM where we can have access to the analytix cluster and can install all software we need
* rely on what is already on CVMFS from Swan project
* Luca will work on providing the VM

* testing with hardware provided by Intel
* provided by Intel to perform benchmarks
* dedicated capacity, available for stress tests
* includes various optimizations of software
* otherwise no difference to analytix
* copied 1.2 TB data from CMS open data
* Victor is testing

* performance metrics / timing
* history server of Spark could be a possibility
* for now, Victor used the "spark listener" to get time stamps
* measures the time every executor takes
* need to build a jar and include it ➜ Victor's listener implementation is very specific
* maybe we can have a central jar available for everyone to use, would need to extended and generalize Victor's implementation ➜ Jim and Victor?
* Victor will send location of repository of his code
* other performance metrics that we should measure
* execution time, on executor and driver side
* CPU
* parallelism of execution

* discussing plan google docs
* Swan is proposed to be the way of using the analytix cluster because Swan is already working on the integration of all components
* thrust 1
* "make plots"
* thrust 2
* "perform fits"
* possible setup:
* starting from root files
* 3 orders of reduction to data frames
* store as numpy
* read it in with numpy to root
* then use the CMS fitting infrastructure
* action items
* Luca will create VM
* Need generic method/code base to measure performance (keyword Spark Listener) ➜ Jim and Victor?

* plan:
* Have Scala code in 2 weeks to start performance investigations with Illia ➜ Matteo and OLI

* Question: conversion vs. direct ROOT reading
* Opinion is not to pursue conversion
* How did we come to this decision ➜ convenience
* conversion wouldn't have widespread adaption
* Are we sure that by not converting we get the performance we need
* argument: using conversion in spark is a lot better (using inbuilt parallelization)
* Come back to this topic next time

* Next meeting in 2 weeks, March 8th

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
  analytix tutorial overview:
  
  setup: https://github.com/diana-hep/spark-root/blob/master/UserGuideSetupAnalytix.md
  
  Jupyter notebook + histogrammer: https://github.com/diana-hep/spark-root/blob/master/ipynb/publicCMSMuonia_exampleAnalysis_wROOT.ipynb
  
  Scala: https://github.com/diana-hep/spark-root/blob/master/UserGuideScala.md
  
  plan for today
  
  Discuss analytix usage and need for scala compilation/packaging
  
  Luca suggested to setup VM with all needed software with access to the cluster
  
  Next steps
- 16:05 → 17:00
  
  Discussion 55m
  
  Google docs: Big Data plans

Choose timezone

CMS Big Data Science Project