CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2016-10-26T10:00:00-05:00
End: 2016-10-26T11:00:00-05:00
Location: No location set

Wednesday 26 Oct 2016, 10:00 → 11:00 US/Central

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

## 19th Big Data meeting - 161026

* indico: https://indico.cern.ch/event/580051/

* attendance: Matteo, JimP, JimK, Saba, Bo, Alexey, Illia

* conferences/papers/writeups
*    CHEP
*    went well
*    main discussion was AVRO and why we used it
*    Another presentation showed doing the same as us but with ElasticSearch, which increased the size per event due to the indices between a factor 5 to a factor 2
*    paper in preparation
*    Grace Hopper
*    went very well
*    talking a lot about the physics and what analysis means
*    after the talk, people came and chatted with Saba, everyone understood what Saba wanted to convey in her talk
*    industry was interested: Oracle and Intel (HPC technical people)
* A team of Oracle wanted to follow up on what we have learned so far
* no paper
* SuperComputing
* will have a poster with both topics, no proceedings
* booth demo (planned, but not confirmed yet), will target on NERSC workflow
* DOE report
* can use SC poster paper to prepare writeup
* Saba wants to submit a paper to HPDC, deadline in January -> concentrate on HDF5 and data layouts

*    Intel milestones:
*    https://goo.gl/11qqZ2
*    approved by all

* status interactive database
* short description: instant plots to user on specially prepared data (on the scale of TB)
* first version hinges on GPUs as optimizer

* discussion about how to continue after CHEP
* Nhan brought up that analysis will always have to run/rerun parts of the central reconstruction code for analysis-specific purposes, Matteo strongly agrees
* Proposal is to partition the problem
* full reconstruction: CMSSW
* end-user analysis reconstruction: CMSSW
* data reduction: big data
* remark: a combined facility (already proposed) would cover this complete use case
* JimP starts discussion of next steps
* Problem getting data out of ROOT is the focus
* we can do the data conversion better (we had a partial solution, others as well) -> we could pull it together into a shared solution
* ES
* approaches on the table:
* python through root_numpy (can read CMSSW files)
* PureJDM solution might work
* libhdfs.so in C
* some need intermediate files, some could read root files directly out of HDFS
* skipping the high level middle man -> more defined problem, faster
* discussion important, we will continue this, first on a google doc: https://goo.gl/Iev618

* next meeting, in 2 weeks: 7. - 11. November 2016
* CMS Offline & Computing week
* several other constraints
* will try to find possible day/time

There are minutes attached to this event. Show them.

- 10:00 → 10:05
  
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
  
  - both presentations at Grace Hopper and CHEP went well
  
  - lets discuss next steps and the proposed milestones for Intel

Choose timezone

CMS Big Data Science Project