CMS Big Data Science Project
DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## 19th Big Data meeting - 161026
* indico: https://indico.cern.ch/event/580051/
* attendance: Matteo, JimP, JimK, Saba, Bo, Alexey, Illia
* conferences/papers/writeups
* CHEP
* went well
* main discussion was AVRO and why we used it
* Another presentation showed doing the same as us but with ElasticSearch, which increased the size per event due to the indices between a factor 5 to a factor 2
* paper in preparation
* Grace Hopper
* went very well
* talking a lot about the physics and what analysis means
* after the talk, people came and chatted with Saba, everyone understood what Saba wanted to convey in her talk
* industry was interested: Oracle and Intel (HPC technical people)
* A team of Oracle wanted to follow up on what we have learned so far
* no paper
* SuperComputing
* will have a poster with both topics, no proceedings
* booth demo (planned, but not confirmed yet), will target on NERSC workflow
* DOE report
* can use SC poster paper to prepare writeup
* Saba wants to submit a paper to HPDC, deadline in January -> concentrate on HDF5 and data layouts
* Intel milestones:
* https://goo.gl/11qqZ2
* approved by all
* status interactive database
* short description: instant plots to user on specially prepared data (on the scale of TB)
* first version hinges on GPUs as optimizer
* discussion about how to continue after CHEP
* Nhan brought up that analysis will always have to run/rerun parts of the central reconstruction code for analysis-specific purposes, Matteo strongly agrees
* Proposal is to partition the problem
* full reconstruction: CMSSW
* end-user analysis reconstruction: CMSSW
* data reduction: big data
* remark: a combined facility (already proposed) would cover this complete use case
* JimP starts discussion of next steps
* Problem getting data out of ROOT is the focus
* we can do the data conversion better (we had a partial solution, others as well) -> we could pull it together into a shared solution
* ES
* approaches on the table:
* python through root_numpy (can read CMSSW files)
* PureJDM solution might work
* libhdfs.so in C
* some need intermediate files, some could read root files directly out of HDFS
* skipping the high level middle man -> more defined problem, faster
* discussion important, we will continue this, first on a google doc: https://goo.gl/Iev618
* next meeting, in 2 weeks: 7. - 11. November 2016
* CMS Offline & Computing week
* several other constraints
* will try to find possible day/time