CMS Big Data Science Project

Wednesday 16 Mar 2016, 10:00 → 11:00 US/Central

Description

DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

Reminder: first milestone
- load BACON ntuples in hadoop+spark and produce a plot by Beginning of April

broke up the analysis in 2 pieces
- skimming
  - getting the data out of root (Jim)
    - 1st version: json files with avro schema
  - loading data into hadoop+spark
    - Alexey
      - used test instance at Princeton, used local file system
      - convert json and schema into avro file, loaded file into hadoop
    - Saba
      - loaded into HDFS on local development cluster using json files
      - reported on some problems (infinities, NANs, another problem: complaining about braces between two records)
        
        every line is json ➜ load line by line instead of the whole file content into json parser
    - Cristina and Jim
      - stand alone mode
      - took json file, without converting it into avro, using SparkSQL data frames
      - convert SparkSQL data frames into objects (very inefficient, but good to start)
        
        gave up on SparkSQL direct analysis
        
        Bacon ttree is complex structure, SQL is inconvenient to handle these
      - python program to filter events ➜ filtered events then can be plotted in python
  - for first milestone, about 10-15 datasets totaling about 100 GB need to be used ➜ see action items
- analysis (histogram plotting)
  - there are no nice plotting libraries in scala and java
  - Cristina:
    - use python plotting libraries, everything except pyroot
  - Saba:
    - SparkR
  - Aleksey
    - python
- performance discussion
  - skimming part should be moved from python to Scala as soon as possible

hardware/facilities
- to scale up:
  - data ingest needs to be more efficient: JimP is working on it
  - we start needing a cluster, at 100 GB working on a local instance on your laptop becomes difficult
  - also 100 GB does not fit into memory anymore, rapid iterations also need a cluster
- concentrating on Princeton test cluster
  - need access and 6 accounts, see action items
- DataBrix academic
  - free small test cluster
  - link was posted in slack

training
- introductory training course for Matteo, Cristina, OLI, etc.
  - JimP is gathering some resources for Cristina
  - Saba will send her favorite courses around on slack
- Amazon: AWS Monthly Webinar Series – March
  - Learn how to extend the capabilities of your data warehouse with Hadoop and Spark and best practices for integrating these two technologies
    - https://publish.awswebcasts.com/content/connect/c1/7/en/events/event/shared/30108902/event_landing.html?connect-session=graysonbreezkvsuk4diqm4gw3n4&sco-id=30066676&campaign-id=emc_11928&_charset_=utf-8
  - Learn how to build your data lake in the cloud with AWS and deliver a far more agile and flexible architecture to enable new types of analytical insights
    - https://publish.awswebcasts.com/content/connect/c1/7/en/events/event/shared/30108902/event_landing.html?connect-session=graysonbreezkvsuk4diqm4gw3n4&sco-id=30066746&campaign-id=emc_11928&_charset_=utf-8
  - Sign-up link:
    - http://aws.amazon.com/about-aws/events/monthlywebinarseries/?sc_channel=em&sc_campaign=eventreg_webinar_mwebinar-marchmonthlywebinar2016&sc_publisher=aws&sc_content=webinar&sc_country=mult&sc_geo=global&sc_category=mult&campaign-id=emc_11928&trkcampaign=WEBINARSERIES&trk=em_march2016_series&mkt_tok=3RkMMJWWfF9wsRonuqXLeu%2FhmjTEU5z17uwpXa6%2BlMI%2F0ER3fOvrPUfGjI4JSMNrI%2BSLDwEYGJlv6SgFS7HHMbR617gKXRc%3D

publications
- CHEP abstract(s): April 11st
  - OLI will write first draft
- SC computing abstract: due July 1st
  - for NERSC tests and more advanced computational investigations

communication in slack
- do everything public
- comments are very helpful

Action items
- Cristina is going to calculate the amount of data for the first plot (signal region) we need to load
  - various data and MC samples
  - possibly 100 GB
- Aleksey
  - can we use the princeton test cluster for all our tests and development ➜ in principle yes
  - need to figure out first access for JimP
  - then we need 5 additional accounts
- OLI:
  - create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
- OLI:
  - start chep abstract
- Cristina and JimP
  - setup github project for analysis code and other utilities
  - most of JimP’s work that is general will live in DIANA github project
- Cristina, JimP and Matteo:
  - decouple dictionaries and BACON libraries from CMSSW

There are minutes attached to this event. Show them.

The agenda of this meeting is empty