CMS Big Data Science Project
FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
## Meeting 160803
* attendance: Matteo, Alexey, Saba, JimK, Igor, Kacper, Luca, Bo, Illia, Cristina
* news
* Introducing third thrust: NoSQL
* Igor Mandrichenko will start investigating NoSQL databases and how our use case could be realized using them
* We need to get him up to speed on the use case, help him get started with small test samples and show him how the analysis code works
* Poster for SC
* Was submitted in time on Friday, July 22nd, but in a very rudimentary form
* It is not clear if we get the chance to update
* In general, we need to be better prepared and for that, need to start earlier with the preparation
* To all: if you would like to send abstracts or papers to conferences to present, please let us all know as early as possible so that we can help
* CHEP
* CHEP starts October 9th
* By then we need to have a paper written that describes the use case realization in Spark and presents performance comparisons
* The text of the paper will also be used to write a report for DOE
* Should we start writing the paper? Overleaf? Yes
* Mailing list
* Is the slack forum enough to reach everyone?
* We could have a google group in addition. Opinions?
* goals:
* Princeton workflow
* progress on full scale test ➜ we can get to plots
* polishing the code
* 2 things left, working with Jim and Alexey
* running on smallest and largest sample
* work on the plotting while Alexey is at FNAL
* last test: persisted input file in Spark, first run takes 3 min, then 14 seconds
* have to complete all samples
* size of cluster: 4 service nodes and 6 worker nodes (138 virtual cores with hyper-threading enabled, 720 GB memory total
* understand what to do for the next two weeks
* writing documentation is becoming crucial that can enable people to run the workflow
* plan to complete it by the next meeting
* documentation how to use jupyter notebook in local browser and to connect it through ssh tunnels to the princeton spark instance
* spark-enabled jupiter notebook is running on the cluster
* ssh tunnel connects to it
* browser displays
* NERSC
* Reader would be different
* but then the scala analysis code is the same and can be reused
* will use it from github
* ROOT files have to convert to HD5 files
* not yet ready to transfer all the files to NERSC
* read HD5 files directly into spark program
* working reading in HD5 into spark
* many things are not convenient and even not working
* existing SparkHD5 library work exhausted
* not suited for our purposes
* we have to write out own reader function
* figuring out which layout is better suited for Spark
* goal for next meeting: write reader function
* scaling plan: ~1000 cores
* NoSQL
* question if database approach is the right approach for the use case
* Igor: not wanting to implement it if it does not make sense
* discussion that it’s then up to him to determine that
* Matteo and will forward the use case document
* CERN
* after the documentation, we will work on running the same workflow at CERN
* inputs to CERN
* root 2 avro conversion on the fly
* storing avro on HDFS
* code:
* can use the same code
* after CHEP
* right now we rely on input files that are custom
* eventually we want to read the central CMS use case
* redesigning the workflow completely
* right now we are optimizing the book keeping needed when running the ROOT workflow
* replace that step to read official CMS files within CMSSW
* need to ask us if we want to just adapt the workflow or if we want to completely redesign the workflow
* format question has to be solved, do we need an intermediate step?
* right now, cannot read MINIAOD directly
* there is some c++ code that cannot be easily converted
* end goal:
* at some time, we have to have discussion if Spark is the right technology
* ROOT workflow performance numbers are not yet done, waiting for documentation, because we want to compare apples with apples