CMS Big Data Science Project
PPD/ Round Table-WH11SE - Wilson Hall 11th fl South East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## 170222 - Big Data Meeting
* attendance: Alexey, Bo, Allia, Ian, Luca (and other CERN people), Viktor, Matteo, Jim, OLI
* jupyter notebook
* don't use the notebook internal possibility to open a 2nd notebook file, instead start a 2nd notebook pyspark instance from the shell
* scala
* needs compilation/packaging into jar
* Luca's suggestion: create VM where we can have access to the analytix cluster and can install all software we need
* rely on what is already on CVMFS from Swan project
* Luca will work on providing the VM
* testing with hardware provided by Intel
* provided by Intel to perform benchmarks
* dedicated capacity, available for stress tests
* includes various optimizations of software
* otherwise no difference to analytix
* copied 1.2 TB data from CMS open data
* Victor is testing
* performance metrics / timing
* history server of Spark could be a possibility
* for now, Victor used the "spark listener" to get time stamps
* measures the time every executor takes
* need to build a jar and include it ➜ Victor's listener implementation is very specific
* maybe we can have a central jar available for everyone to use, would need to extended and generalize Victor's implementation ➜ Jim and Victor?
* Victor will send location of repository of his code
* other performance metrics that we should measure
* execution time, on executor and driver side
* CPU
* parallelism of execution
* discussing plan google docs
* Swan is proposed to be the way of using the analytix cluster because Swan is already working on the integration of all components
* thrust 1
* "make plots"
* thrust 2
* "perform fits"
* possible setup:
* starting from root files
* 3 orders of reduction to data frames
* store as numpy
* read it in with numpy to root
* then use the CMS fitting infrastructure
* action items
* Luca will create VM
* Need generic method/code base to measure performance (keyword Spark Listener) ➜ Jim and Victor?
* plan:
* Have Scala code in 2 weeks to start performance investigations with Illia ➜ Matteo and OLI
* Question: conversion vs. direct ROOT reading
* Opinion is not to pursue conversion
* How did we come to this decision ➜ convenience
* conversion wouldn't have widespread adaption
* Are we sure that by not converting we get the performance we need
* argument: using conversion in spark is a lot better (using inbuilt parallelization)
* Come back to this topic next time
* Next meeting in 2 weeks, March 8th