CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2017-04-05T16:00:00+02:00
End: 2017-04-05T17:00:00+02:00
Location: No location set

Wednesday 5 Apr 2017, 16:00 → 17:00 Europe/Berlin

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL: WH 11 NW ROC

CERN: 600-R-001

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

* attendance: Illia, Alexey, JimP, Saba, JimK, Bo, Matteo, Viktor, Kacper,
Vagg, OLI

* News
* Dates and Events
* CMS Offline & Computing week just over, we gave status report about
Big Data
* 170404 - Status of CMS Big Data Project.pdf
* CERN Openlab workshop on Machine Learning and Data Analytics, April
27, CERN
* https://indico.cern.ch/event/627852/
* We will have a talk
* DS@HEP at FNAL, May 8-12, FNAL
* https://indico.fnal.gov/conferenceDisplay.py?confId=13497
* Matteo will give a talk
* HEP Analysis Ecosystem Workshop, May 22-24, Amsterdam
* https://indico.cern.ch/event/613842/timetable/
* “Database Futures” workshop at CERN on May 29th-30th
* https://indico.cern.ch/event/615499/
* to discuss possible future needs in the database area for Run3+4.
Today we see mostly relational and non-relational database
models.
* New trends are Cloud Computing, Big Data, proactive & predictive
performance analysis, …
* We should write an abstract!
* Need to write abstracts for
* ACAT *
* JimP and Igor: database backend to NoSQL project (one abstyract)
and query language part (2nd abstract)
* "Database Futures" workshop
* anything else?

* Saba:
* reviewing DS abstract for Grace-Hopper conference
* next time further updates
* Matteo
* meeting couple of days ago, clarified things
* Viktor asked for Panda ntuples, will be moved to EOS at CERN, then into
HDFS
* we could copy it directly into HDFS at CERN
* Matteo will describe the C++ workflow to produce Panda ntuples
* Request from Luca
* publicly accessible task list for everyone -> Vagg is working on this
* Vagg
* EOS direct reading is in progress
* status of reading open data and selection/reduction
* not done this week, done soon

* Thrust 1: Panda
* Schema was different for CHEP paper, completely different code base
* From spark standpoint, need to put objects back together because it is a
flat tree
* concluded that no code is needed from Viktor, all functionality is there
* need a small update of spark-root
* JimP wants to change the plotting commands to SQL from RDDs so that it is
more efficient, because it is a flat tree
* Histogrammer web has a lot of tricks
* goal is to provide Matteo with one example how to use Histogrammer

* Performance
* various effects seen as single user alone on a single cluster
* multi-user usage on analytix would distort such measuremements too much
* Intel simulations could help understand the performance feature, it is
expensive though to implement it in the simulation package
* best workload should be close to final so that it makes sense to ask
the Intel team to simulate it
* for now, Illia will share the slides with his team and see if and
what suggestions they might have

* next meeting:
* April 19, 4 PM CERN time

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
  Dates and Events
  
  CMS Offline & Computing week just over, we gave status report about Big Data
  
  170404 - Status of CMS Big Data Project.pdf
  
  CERN Openlab workshop on Machine Learning and Data Analytics, April 27, CERN
  
  https://indico.cern.ch/event/627852/
  
  We will have a talk
  
  DS@HEP at FNAL, May 8-12, FNAL
  
  https://indico.fnal.gov/conferenceDisplay.py?confId=13497
  
  Matteo will give a talk
  
  HEP Analysis Ecosystem Workshop, May 22-24, Amsterdam
  
  https://indico.cern.ch/event/613842/timetable/
  
  “Database Futures” workshop at CERN on May 29th-30th
  
  https://indico.cern.ch/event/615499/
  
  to discuss possible future needs in the database area for Run3+4. Today we see mostly relational and non-relational database models.
  
  New trends are Cloud Computing, Big Data, proactive & predictive performance analysis, …
  
  We should write an abstract!
  
  Need to write abstracts for
  
  ACAT
  
  "Database Futures" workshop
  
  anything else?
- 16:05 → 16:35
  
  Performance metrics 30m
  
  Speaker: Viktor Khristenko (The University of Iowa (US))
  
  gist for a quick look at Panda (ROOT!!!) file
  
  sparkroot_performance_24032017.pdf
- 16:35 → 17:00
  
  Discussion 25m

Choose timezone

CMS Big Data Science Project