- Reminder: first milestone
- load BACON ntuples in hadoop+spark and produce a plot by Beginning of April
- broke up the analysis in 2 pieces
- skimming
- getting the data out of root (Jim)
- 1st version: json files with avro schema
- loading data into hadoop+spark
- Alexey
- used test instance at Princeton, used local file system
- convert json and schema into avro file, loaded file into hadoop
- Saba
- loaded into HDFS on local development cluster using json files
- reported on some problems (infinities, NANs, another problem: complaining about braces between two records)
- every line is json ➜ load line by line instead of the whole file content into json parser
- Cristina and Jim
- stand alone mode
- took json file, without converting it into avro, using SparkSQL data frames
- convert SparkSQL data frames into objects (very inefficient, but good to start)
- gave up on SparkSQL direct analysis
- Bacon ttree is complex structure, SQL is inconvenient to handle these
- python program to filter events ➜ filtered events then can be plotted in python
- for first milestone, about 10-15 datasets totaling about 100 GB need to be used ➜ see action items
- analysis (histogram plotting)
- there are no nice plotting libraries in scala and java
- Cristina:
- use python plotting libraries, everything except pyroot
- Saba:
- Aleksey
- performance discussion
- skimming part should be moved from python to Scala as soon as possible
- hardware/facilities
- to scale up:
- data ingest needs to be more efficient: JimP is working on it
- we start needing a cluster, at 100 GB working on a local instance on your laptop becomes difficult
- also 100 GB does not fit into memory anymore, rapid iterations also need a cluster
- concentrating on Princeton test cluster
- need access and 6 accounts, see action items
- DataBrix academic
- free small test cluster
- link was posted in slack
- training
- introductory training course for Matteo, Cristina, OLI, etc.
- JimP is gathering some resources for Cristina
- Saba will send her favorite courses around on slack
- Amazon: AWS Monthly Webinar Series – March
- Learn how to extend the capabilities of your data warehouse with Hadoop and Spark and best practices for integrating these two technologies
- Learn how to build your data lake in the cloud with AWS and deliver a far more agile and flexible architecture to enable new types of analytical insights
- Sign-up link:
- publications
- CHEP abstract(s): April 11st
- OLI will write first draft
- SC computing abstract: due July 1st
- for NERSC tests and more advanced computational investigations
- communication in slack
- do everything public
- comments are very helpful
- Action items
- Cristina is going to calculate the amount of data for the first plot (signal region) we need to load
- various data and MC samples
- possibly 100 GB
- Aleksey
- can we use the princeton test cluster for all our tests and development ➜ in principle yes
- need to figure out first access for JimP
- then we need 5 additional accounts
- OLI:
- create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
- OLI:
- Cristina and JimP
- setup github project for analysis code and other utilities
- most of JimP’s work that is general will live in DIANA github project
- Cristina, JimP and Matteo:
- decouple dictionaries and BACON libraries from CMSSW