CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2017-03-08T16:00:00+01:00
End: 2017-03-08T17:00:00+01:00
Location: No location set

Wednesday 8 Mar 2017, 16:00 → 17:00 Europe/Berlin

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

## 170308 - 25th Big Data Meeting

* attendance: Alexey, Igor, Matteo, Kacper, Luca, Victor, Ian, Illia, Luca, Vaggelis, Saba

* news
* Vaggelis: new CERN fellow ➜ Welcome!
* first task: Reading ROOT files in Spark from EOS
* Luca will setup meeting on Tuesday, March 14, afternoon, with Vaggelis, OLI, Kacper and himself
* OLI will setup meeting with Vaggelis, Matteo, Jim, Kacper, Luca next week (maybe ~Thursday) to get to know each other
* Evangelos has to be added to google groups, slack, etc.
* Victor
* results from Intel cluster
* polishing results, then maybe report here
* useful, good investment in characterizing performance on spark
* working on making it more universal
* to look into further: I/O performance characterization
* IBM got started
* small introduction
* difficult in the beginning
* they are using their own storage system
* couldn't download dependencies from maven
* resolved now
* bottom line: able to read root files
* Jim: Have we seen a DQM root file ➜ correction, will use real data files, bottom line, its going to be TTree (MINIAOD, AOD or RECO data tier)
* no new technology needs to be developed for now

* Striped Event Project
* presentation: http://tinyurl.com/jxl3t52
* project goal
* reduce time to physics, especially speeding up iterations
* not tackling general computation like spark
* tackling, aggregation, selection, ...
* tackling interactivity and/or high turnaround
* hardware:
* CouchBase: old farm hardware, dataset uses 1.9TB
* big portion can be in memory and immediately available
* enginX web cache on SSD
* client: old development machine with 16 cores
* workers should not be transient, but permanent on own hardware
* overall 2 layers of very fast data cache ➜ take advantage of stripes being cacheable easily
* event per group
* using 1,000 for most datasets
* 10,000 seems to be better, from small investigation
* performance
* cached 1M events: 1.5s
* not cached: ~factor 10 slower
* features:
* one sample was efficiently stored with 10,000 events per group
* other sample with 1,000 events per group
* seeing a difference in performance
* right now, we're in the MHz range
* we can get to the GHz range by introducing memory cache in the workers and make them persistent
* histograms
* dynamically built
* you can see the data being filled, you can stop and implement changes to correct problems
* deployment
* modular
* plenty of variation of deployment within the same data center and cross-center
* next steps
* Jim is working on making the worker persistent and including memory cache
* currently everything is in python, Jim is working on something more performant and suitable
* looking at different backends and also the user laptop analysis use case
* replace lowest two layers with local disk and keep the API the same (run remote and centralized and on own local computing should look the same)
* virtual datasets
* recalculate parts of a dataset
* hide combining two datasets into one virtual dataset on the server ➜ allow to override parts of the data
* skimming
* incorporate skimming tools into the design

* Saba
* collaborative work with NERSC
* improve HD5 to Spark read process
* Saba gave a presentation to NERSC on details
* received couple of suggestions, following up
* moved from Edison to Cori Phase 1 (Haswell architecture), significant speed up (5x)
* first fine-tune on Cori phase 1 before thinking about Cori phase 2 (KNL)
* submitted paper to workshop in January, got accepted, camera-ready version of the paper due March 22 ➜ will be presented in May

* Next meeting: March 22

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  News 5m
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
- 16:05 → 16:35
  
  Striped event storage: update and demo 30m
  
  Speaker: Igor Mandrichenko (Fermilab)
  
  StripedDemo.pdf
- 16:35 → 17:00
  
  Discussion 25m
  
  Google docs: Big Data plans

Choose timezone

CMS Big Data Science Project