CMS Big Data Science Project

US/Central
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## 19th Big Data meeting - 161026

* indico: https://indico.cern.ch/event/580051/

* attendance: Matteo, JimP, JimK, Saba, Bo, Alexey, Illia

* conferences/papers/writeups
    *     CHEP
        *     went well
        *     main discussion was AVRO and why we used it
        *     Another presentation showed doing the same as us but with ElasticSearch, which increased the size per event due to the indices between a factor 5 to a factor 2
        *     paper in preparation
    *     Grace Hopper
        *     went very well
        *     talking a lot about the physics and what analysis means
        *     after the talk, people came and chatted with Saba, everyone understood what Saba wanted to convey in her talk
        *     industry was interested: Oracle and Intel (HPC technical people)
            * A team of Oracle wanted to follow up on what we have learned so far
        * no paper
    * SuperComputing 
        * will have a poster with both topics, no proceedings
        * booth demo (planned, but not confirmed yet), will target on NERSC workflow
    * DOE report
        * can use SC poster paper to prepare writeup
    * Saba wants to submit a paper to HPDC, deadline in January -> concentrate on HDF5 and data layouts

*     Intel milestones:
    *     https://goo.gl/11qqZ2
    *     approved by all

* status interactive database
    * short description: instant plots to user on specially prepared data (on the scale of TB)
    * first version hinges on GPUs as optimizer

* discussion about how to continue after CHEP
    * Nhan brought up that analysis will always have to run/rerun parts of the central reconstruction code for analysis-specific purposes, Matteo strongly agrees
    * Proposal is to partition the problem
        * full reconstruction: CMSSW
        * end-user analysis reconstruction: CMSSW
        * data reduction: big data
        * remark: a combined facility (already proposed) would cover this complete use case
    * JimP starts discussion of next steps
        * Problem getting data out of ROOT is the focus
            * we can do the data conversion better (we had a partial solution, others as well) -> we could pull it together into a shared solution
                * ES
            * approaches on the table: 
                * python through root_numpy (can read CMSSW files)
                * PureJDM solution might work
                * libhdfs.so in C
            * some need intermediate files, some could read root files directly out of HDFS
            * skipping the high level middle man -> more defined problem, faster
    * discussion important, we will continue this, first on a google doc: https://goo.gl/Iev618

* next meeting, in 2 weeks: 7. - 11. November 2016
    * CMS Offline & Computing week
    * several other constraints
    * will try to find possible day/time

There are minutes attached to this event. Show them.
    • 10:00 10:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

      - both presentations at Grace Hopper and CHEP went well

      - lets discuss next steps and the proposed milestones for Intel