Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

CMS Big Data Science Project

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))


CERN: 600-R-001

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

* attendance: Illia, Alexey, JimP, Saba, JimK, Bo, Matteo, Viktor, Kacper,
  Vagg, OLI

* News
    * Dates and Events
        * CMS Offline & Computing week just over, we gave status report about
          Big Data
            * 170404 - Status of CMS Big Data Project.pdf
        * CERN Openlab workshop on Machine Learning and Data Analytics, April
          27, CERN
            * We will have a talk
        * DS@HEP at FNAL, May 8-12, FNAL
            * Matteo will give a talk
        * HEP Analysis Ecosystem Workshop, May 22-24, Amsterdam
        * “Database Futures” workshop at CERN on May 29th-30th
            * to discuss possible future needs in the database area for Run3+4.
              Today we see mostly relational and non-relational database
            * New trends are Cloud Computing, Big Data, proactive & predictive
              performance analysis, …
            * We should write an abstract!
    * Need to write abstracts for
        * ACAT * 
            * JimP and Igor: database backend to NoSQL project (one abstyract)
              and query language part (2nd abstract)
        * "Database Futures" workshop
        * anything else?
 * Saba:
    * reviewing DS abstract for Grace-Hopper conference
    * next time further updates
* Matteo
    * meeting couple of days ago, clarified things
    * Viktor asked for Panda ntuples, will be moved to EOS at CERN, then into
        * we could copy it directly into HDFS at CERN
    * Matteo will describe the C++ workflow to produce Panda ntuples
* Request from Luca
    * publicly accessible task list for everyone -> Vagg is working on this
* Vagg
    * EOS direct reading is in progress
* status of reading open data and selection/reduction
    * not done this week, done soon

* Thrust 1: Panda
    * Schema was different for CHEP paper, completely different code base
    * From spark standpoint, need to put objects back together because it is a
      flat tree
    * concluded that no code is needed from Viktor, all functionality is there
        * need a small update of spark-root
    * JimP wants to change the plotting commands to SQL from RDDs so that it is
      more efficient, because it is a flat tree
    * Histogrammer web has a lot of tricks
        * goal is to provide Matteo with one example how to use Histogrammer 

* Performance
    * various effects seen as single user alone on a single cluster
    * multi-user usage on analytix would distort such measuremements too much
    * Intel simulations could help understand the performance feature, it is
      expensive though to implement it in the simulation package
        * best workload should be close to final so that it makes sense to ask
          the Intel team to simulate it
        * for now, Illia will share the slides with his team and see if and
          what suggestions they might have

* next meeting:
    * April 19, 4 PM CERN time

There are minutes attached to this event. Show them.