CMS Big Data Science Project
→
US/Central
Description
DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East
CERM room:
Instructions to create a light-weight CERN account to join the meeting via Vidyo:
If not possible, people can join the meeting by the phone, call-in numbers are here:
The meeting id is hidden below the Videoconference Rooms link, but here it is again:
- 10502145
- Reminder: first milestone
- load BACON ntuples in hadoop+spark and produce a plot by Beginning of April
- broke up the analysis in 2 pieces
- skimming
- getting the data out of root (Jim)
- 1st version: json files with avro schema
- loading data into hadoop+spark
- Alexey
- used test instance at Princeton, used local file system
- convert json and schema into avro file, loaded file into hadoop
- Saba
- loaded into HDFS on local development cluster using json files
- reported on some problems (infinities, NANs, another problem: complaining about braces between two records)
- every line is json ➜ load line by line instead of the whole file content into json parser
- Cristina and Jim
- stand alone mode
- took json file, without converting it into avro, using SparkSQL data frames
- convert SparkSQL data frames into objects (very inefficient, but good to start)
- gave up on SparkSQL direct analysis
- Bacon ttree is complex structure, SQL is inconvenient to handle these
- python program to filter events ➜ filtered events then can be plotted in python
- Alexey
- for first milestone, about 10-15 datasets totaling about 100 GB need to be used ➜ see action items
- getting the data out of root (Jim)
- analysis (histogram plotting)
- there are no nice plotting libraries in scala and java
- Cristina:
- use python plotting libraries, everything except pyroot
- Saba:
- SparkR
- Aleksey
- python
- performance discussion
- skimming part should be moved from python to Scala as soon as possible
- skimming
- hardware/facilities
- to scale up:
- data ingest needs to be more efficient: JimP is working on it
- we start needing a cluster, at 100 GB working on a local instance on your laptop becomes difficult
- also 100 GB does not fit into memory anymore, rapid iterations also need a cluster
- concentrating on Princeton test cluster
- need access and 6 accounts, see action items
- DataBrix academic
- free small test cluster
- link was posted in slack
- to scale up:
- training
- introductory training course for Matteo, Cristina, OLI, etc.
- JimP is gathering some resources for Cristina
- Saba will send her favorite courses around on slack
- Amazon: AWS Monthly Webinar Series – March
- Learn how to extend the capabilities of your data warehouse with Hadoop and Spark and best practices for integrating these two technologies
- Learn how to build your data lake in the cloud with AWS and deliver a far more agile and flexible architecture to enable new types of analytical insights
- Sign-up link:
- introductory training course for Matteo, Cristina, OLI, etc.
- publications
- CHEP abstract(s): April 11st
- OLI will write first draft
- SC computing abstract: due July 1st
- for NERSC tests and more advanced computational investigations
- CHEP abstract(s): April 11st
- communication in slack
- do everything public
- comments are very helpful
- Action items
- Cristina is going to calculate the amount of data for the first plot (signal region) we need to load
- various data and MC samples
- possibly 100 GB
- Aleksey
- can we use the princeton test cluster for all our tests and development ➜ in principle yes
- need to figure out first access for JimP
- then we need 5 additional accounts
- OLI:
- create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
- OLI:
- start chep abstract
- Cristina and JimP
- setup github project for analysis code and other utilities
- most of JimP’s work that is general will live in DIANA github project
- Cristina, JimP and Matteo:
- decouple dictionaries and BACON libraries from CMSSW
- Cristina is going to calculate the amount of data for the first plot (signal region) we need to load
There are minutes attached to this event.
Show them.
The agenda of this meeting is empty