CMS Big Data Science Project

Europe/Berlin
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

PPD/ Quarium-WH8SW - Wilson Hall 8th fl South West

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

## 170322 - Big Data Meeting

* attendance: JimP, JimK, Matteo, Saba, Kacper, Luca, Vagg, Bo, Illia

* news
    * ACAT abstract submission deadline: 29 April 2017
        * https://indico.cern.ch/event/567550/abstracts/
        * JimP wants to submit abstracts about FemtoCode
    * IEEE Big Data December 2017
        * Call for paper is open and will remain open for most of the year
    * CERN Openlab workshops with aim of identifying potential areas of work
      for the next three-year phase of CERN openlab (phase VI)
        * Data Center Technologies and Infrastructures, March 1, agenda:
          http://tinyurl.com/l4ru23o
        * Compute Platforms and Software, March 23, agenda:
          http://tinyurl.com/l3uddp6
        * Machine learning and data analytics, April 27, agenda not yet
          available
            * We will be asked to give a talk about the Intel Big Data project
    * Community White Paper (CWP): A Roadmap for HEP Software and Computing R&D
      for the 2020s
        * main page: http://hepsoftwarefoundation.org/activities/cwp.html
        * working groups:
          http://hepsoftwarefoundation.org/cwp/cwp-working-groups.html
        * Data Analysis and Interpretation WG: google docs:
          http://tinyurl.com/jsxytph
            * Planned workshops
                * CWP discussions at HEP Analysis Ecosystem Retreat, May 22-24
                  (agenda: http://tinyurl.com/lctgf3f)
        * Machine Learning WG: google docs: http://tinyurl.com/m559fmy
            * Planned workshops
                * a CWP session during the IML topical workshop at CERN, March
                  20 - 22, 2017
                * a CWP session (TBC) during DS@HEP 2017, FNAL May 8 - 12, 2017
                    * followed by two days of tutorials at Fermilab
                      (Monday-Tuesday) about machine leanring
                    * hosted by Maurizio Perini
                    * entirely CMS, not sure
                    * hats in May as well, not including spark because of
                      complicated access
        * subscribe to the google groups to stay informed and maybe get
          involved
    * Monthly Intel/Openlab meeting -> Vagg is going to join the meeting and
      report * 

* progress reports
    * Meetings with Vagg
        * hasn't contacted Matteo yet
            * will be organized for next week when Matteo is at CERN
        * JimP will go through spark-root library with Victor and Matteo
        * meeting with EOS people happened
            * First goal: accessing ROOT files from EOS directly from Spark
        * Matteo:
            * Thrust 1: full analysis
                * use Panda ntuples to do analysis up to plots
            * Thrust 2: data reduction
                * from open data to ntuple
                * does not need the full analysis use case
                * sit down decide selection that makes sense from the physics
                  point of view
            * Panda ntuple
                * produced at CERN before, now they are produced at MIT, CERN
                  production should be fine
                * Panda is a couple TB
            * can share infrastructure between the two thrusts
        * Victor:
            * spark-root
            * bugs are being worked on
            * IBM is starting to use it
        * JimP
            * histogrammer: some bug fixes and new features: KPMG (consulting
              company)
              * added some vislauztaion for categorical data
            * SparkR is already getting the parallelism that histogrammer will
              get you
        * Luca
            * comments on the google doc, will discuss next week at CERN
        * Illia
            * Intel would like to start running a simulator to optimize the
              performance
                * need metrics to feed simulator
            * Victor will show his metrics implementation and results form
              running on the intel cluster CERN-It had access to next week
        * Saba
            * one paper was submitted in January focussed on HDF5 and data
              layout and results on NERSC Edison, for International Parallel
              and Distributed COmputing Symposium, IPDPS 2017
                * workshop: High Performance Computation
                * camera ready version of paper was uploaded last night
            * submitted abstract to Grace Hopper -> Focus is on comparing MPI
              and Spark
                * nothing concrete, presentation not until October
            * Saba wants to have a summer student from UIC

* ROOT files from EOS read in Spark
    * 2 solutions
        * access data in Spark via local mount point of EOS (POSIX filesystem
          to EOS, FUSE mount), there are limitation
            * solution should already exist and is not the preferred solution
        * connect to EOS directly (without FUSE mount)
            * two possibilitites
                * write new java code following spark-root resurrection of old
                  ROOT-JAVA implementation
                * or use C++ code in java (JNI) (did I get this right?)

* plan:
    * have data reduction facility physics code ready to be used by next
      meeting

* next meeting
    * April 5th

There are minutes attached to this event. Show them.
    • 16:00 16:05
      News 5m
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

          * ACAT abstract submission deadline: 29 April 2017
              * https://indico.cern.ch/event/567550/abstracts/
          * CERN Openlab workshops with aim of identifying potential areas of work
            for the next three-year phase of CERN openlab (phase VI)
              * Data Center Technologies and Infrastructures, March 1, agenda:
                http://tinyurl.com/l4ru23o
              * Compute Platforms and Software, March 23, agenda:
                http://tinyurl.com/l3uddp6
              * Machine learning and data analytics, April 27, agenda not yet
                available
                  * We will be asked to give a talk about the Intel Big Data project
          * Community White Paper (CWP): A Roadmap for HEP Software and Computing R&D
            for the 2020s
              * main page: http://hepsoftwarefoundation.org/activities/cwp.html
              * working groups:
                http://hepsoftwarefoundation.org/cwp/cwp-working-groups.html
              * Data Analysis and Interpretation WG: google docs:
                http://tinyurl.com/jsxytph
                  * Planned workshops
                      * CWP discussions at HEP Analysis Ecosystem Retreat, May 22-24
                        (agenda: http://tinyurl.com/lctgf3f)
              * Machine Learning WG: google docs: http://tinyurl.com/m559fmy
                  * Planned workshops
                      * a CWP session during the IML topical workshop at CERN, March
                        20 - 22, 2017
                      * a CWP session (TBC) during DS@HEP 2017, FNAL May 8 - 12, 2017
              * subscribe to the google groups to stay informed and maybe get
                involved
          * Monthly Intel/Openlab meeting -> Vagg is going to join the meeting and
            report

       

    • 16:05 17:00
      Discussion 55m