CMS Big Data Science Project
FNAL: WH 11 NW ROC
CERN: 600-R-001
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
* attendance: Illia, Alexey, JimP, Saba, JimK, Bo, Matteo, Viktor, Kacper,
Vagg, OLI
* News
* Dates and Events
* CMS Offline & Computing week just over, we gave status report about
Big Data
* 170404 - Status of CMS Big Data Project.pdf
* CERN Openlab workshop on Machine Learning and Data Analytics, April
27, CERN
* https://indico.cern.ch/event/627852/
* We will have a talk
* DS@HEP at FNAL, May 8-12, FNAL
* https://indico.fnal.gov/conferenceDisplay.py?confId=13497
* Matteo will give a talk
* HEP Analysis Ecosystem Workshop, May 22-24, Amsterdam
* https://indico.cern.ch/event/613842/timetable/
* “Database Futures” workshop at CERN on May 29th-30th
* https://indico.cern.ch/event/615499/
* to discuss possible future needs in the database area for Run3+4.
Today we see mostly relational and non-relational database
models.
* New trends are Cloud Computing, Big Data, proactive & predictive
performance analysis, …
* We should write an abstract!
* Need to write abstracts for
* ACAT *
* JimP and Igor: database backend to NoSQL project (one abstyract)
and query language part (2nd abstract)
* "Database Futures" workshop
* anything else?
* Saba:
* reviewing DS abstract for Grace-Hopper conference
* next time further updates
* Matteo
* meeting couple of days ago, clarified things
* Viktor asked for Panda ntuples, will be moved to EOS at CERN, then into
HDFS
* we could copy it directly into HDFS at CERN
* Matteo will describe the C++ workflow to produce Panda ntuples
* Request from Luca
* publicly accessible task list for everyone -> Vagg is working on this
* Vagg
* EOS direct reading is in progress
* status of reading open data and selection/reduction
* not done this week, done soon
* Thrust 1: Panda
* Schema was different for CHEP paper, completely different code base
* From spark standpoint, need to put objects back together because it is a
flat tree
* concluded that no code is needed from Viktor, all functionality is there
* need a small update of spark-root
* JimP wants to change the plotting commands to SQL from RDDs so that it is
more efficient, because it is a flat tree
* Histogrammer web has a lot of tricks
* goal is to provide Matteo with one example how to use Histogrammer
* Performance
* various effects seen as single user alone on a single cluster
* multi-user usage on analytix would distort such measuremements too much
* Intel simulations could help understand the performance feature, it is
expensive though to implement it in the simulation package
* best workload should be close to final so that it makes sense to ask
the Intel team to simulate it
* for now, Illia will share the slides with his team and see if and
what suggestions they might have
* next meeting:
* April 19, 4 PM CERN time