CMS Big Data Science Project
FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## 170308 - 25th Big Data Meeting
* attendance: Alexey, Igor, Matteo, Kacper, Luca, Victor, Ian, Illia, Luca, Vaggelis, Saba
* news
* Vaggelis: new CERN fellow ➜ Welcome!
* first task: Reading ROOT files in Spark from EOS
* Luca will setup meeting on Tuesday, March 14, afternoon, with Vaggelis, OLI, Kacper and himself
* OLI will setup meeting with Vaggelis, Matteo, Jim, Kacper, Luca next week (maybe ~Thursday) to get to know each other
* Evangelos has to be added to google groups, slack, etc.
* Victor
* results from Intel cluster
* polishing results, then maybe report here
* useful, good investment in characterizing performance on spark
* working on making it more universal
* to look into further: I/O performance characterization
* IBM got started
* small introduction
* difficult in the beginning
* they are using their own storage system
* couldn't download dependencies from maven
* resolved now
* bottom line: able to read root files
* Jim: Have we seen a DQM root file ➜ correction, will use real data files, bottom line, its going to be TTree (MINIAOD, AOD or RECO data tier)
* no new technology needs to be developed for now
* Striped Event Project
* presentation: http://tinyurl.com/jxl3t52
* project goal
* reduce time to physics, especially speeding up iterations
* not tackling general computation like spark
* tackling, aggregation, selection, ...
* tackling interactivity and/or high turnaround
* hardware:
* CouchBase: old farm hardware, dataset uses 1.9TB
* big portion can be in memory and immediately available
* enginX web cache on SSD
* client: old development machine with 16 cores
* workers should not be transient, but permanent on own hardware
* overall 2 layers of very fast data cache ➜ take advantage of stripes being cacheable easily
* event per group
* using 1,000 for most datasets
* 10,000 seems to be better, from small investigation
* performance
* cached 1M events: 1.5s
* not cached: ~factor 10 slower
* features:
* one sample was efficiently stored with 10,000 events per group
* other sample with 1,000 events per group
* seeing a difference in performance
* right now, we're in the MHz range
* we can get to the GHz range by introducing memory cache in the workers and make them persistent
* histograms
* dynamically built
* you can see the data being filled, you can stop and implement changes to correct problems
* deployment
* modular
* plenty of variation of deployment within the same data center and cross-center
* next steps
* Jim is working on making the worker persistent and including memory cache
* currently everything is in python, Jim is working on something more performant and suitable
* looking at different backends and also the user laptop analysis use case
* replace lowest two layers with local disk and keep the API the same (run remote and centralized and on own local computing should look the same)
* virtual datasets
* recalculate parts of a dataset
* hide combining two datasets into one virtual dataset on the server ➜ allow to override parts of the data
* skimming
* incorporate skimming tools into the design
* Saba
* collaborative work with NERSC
* improve HD5 to Spark read process
* Saba gave a presentation to NERSC on details
* received couple of suggestions, following up
* moved from Edison to Cori Phase 1 (Haswell architecture), significant speed up (5x)
* first fine-tune on Cori phase 1 before thinking about Cori phase 2 (KNL)
* submitted paper to workshop in January, got accepted, camera-ready version of the paper due March 22 ➜ will be presented in May
* Next meeting: March 22