• Reminder: first milestone
    • load BACON ntuples in hadoop+spark and produce a plot by Beginning of April

 

  • broke up the analysis in 2 pieces
    • skimming
      • getting the data out of root (Jim)
        • 1st version: json files with avro schema
      • loading data into hadoop+spark
        • Alexey
          • used test instance at Princeton, used local file system
          • convert json and schema into avro file, loaded file into hadoop
        • Saba
          • loaded into HDFS on local development cluster using json files
          • reported on some problems (infinities, NANs, another problem: complaining about braces between two records)
            • every line is json ➜ load line by line instead of the whole file content into json parser
        • Cristina and Jim
          • stand alone mode
          • took json file, without converting it into avro, using SparkSQL data frames
          • convert SparkSQL data frames into objects (very inefficient, but good to start)
            • gave up on SparkSQL direct analysis
            • Bacon ttree is complex structure, SQL is inconvenient to handle these
          • python program to filter events ➜ filtered events then can be plotted in python
      • for first milestone, about 10-15 datasets totaling about 100 GB need to be used ➜ see action items
    • analysis (histogram plotting)
      • there are no nice plotting libraries in scala and java
      • Cristina: 
        • use python plotting libraries, everything except pyroot
      • Saba:
        • SparkR
      • Aleksey
        • python
    • performance discussion
      • skimming part should be moved from python to Scala as soon as possible

 

  • hardware/facilities
    • to scale up:
      • data ingest needs to be more efficient: JimP is working on it
      • we start needing a cluster, at 100 GB working on a local instance on your laptop becomes difficult
      • also 100 GB does not fit into memory anymore, rapid iterations also need a cluster
    • concentrating on Princeton test cluster
      • need access and 6 accounts, see action items
    • DataBrix academic
      • free small test cluster
      • link was posted in slack

 

  • training
    • introductory training course for Matteo, Cristina, OLI, etc.
      • JimP is gathering some resources for Cristina
      • Saba will send her favorite courses around on slack
    • Amazon: AWS Monthly Webinar Series – March
      • Learn how to extend the capabilities of your data warehouse with Hadoop and Spark and best practices for integrating these two technologies
        • https://publish.awswebcasts.com/content/connect/c1/7/en/events/event/shared/30108902/event_landing.html?connect-session=graysonbreezkvsuk4diqm4gw3n4&sco-id=30066676&campaign-id=emc_11928&_charset_=utf-8
      • Learn how to build your data lake in the cloud with AWS and deliver a far more agile and flexible architecture to enable new types of analytical insights
        • https://publish.awswebcasts.com/content/connect/c1/7/en/events/event/shared/30108902/event_landing.html?connect-session=graysonbreezkvsuk4diqm4gw3n4&sco-id=30066746&campaign-id=emc_11928&_charset_=utf-8
      • Sign-up link:
        • http://aws.amazon.com/about-aws/events/monthlywebinarseries/?sc_channel=em&sc_campaign=eventreg_webinar_mwebinar-marchmonthlywebinar2016&sc_publisher=aws&sc_content=webinar&sc_country=mult&sc_geo=global&sc_category=mult&campaign-id=emc_11928&trkcampaign=WEBINARSERIES&trk=em_march2016_series&mkt_tok=3RkMMJWWfF9wsRonuqXLeu%2FhmjTEU5z17uwpXa6%2BlMI%2F0ER3fOvrPUfGjI4JSMNrI%2BSLDwEYGJlv6SgFS7HHMbR617gKXRc%3D

 

  • publications
    • CHEP abstract(s): April 11st
      • OLI will write first draft
    • SC computing abstract: due July 1st
      • for NERSC tests and more advanced computational investigations

 

  • communication in slack
    • do everything public
    • comments are very helpful

 

  • Action items
    • Cristina is going to calculate the amount of data for the first plot (signal region) we need to load
      • various data and MC samples
      • possibly 100 GB
    • Aleksey
      • can we use the princeton test cluster for all our tests and development ➜ in principle yes
      • need to figure out first access for JimP
      • then we need 5 additional accounts
    • OLI:
      • create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
    • OLI:
      • start chep abstract
    • Cristina and JimP
      • setup github project for analysis code and other utilities
      • most of JimP’s work that is general will live in DIANA github project
    • Cristina, JimP and Matteo:
      • decouple dictionaries and BACON libraries from CMSSW