CMS Big Data Science Project

US/Central
Description

DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

 

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

  • Reminder: first milestone
    • load BACON ntuples in hadoop+spark and produce a plot by Beginning of April

 

  • broke up the analysis in 2 pieces
    • skimming
      • getting the data out of root (Jim)
        • 1st version: json files with avro schema
      • loading data into hadoop+spark
        • Alexey
          • used test instance at Princeton, used local file system
          • convert json and schema into avro file, loaded file into hadoop
        • Saba
          • loaded into HDFS on local development cluster using json files
          • reported on some problems (infinities, NANs, another problem: complaining about braces between two records)
            • every line is json ➜ load line by line instead of the whole file content into json parser
        • Cristina and Jim
          • stand alone mode
          • took json file, without converting it into avro, using SparkSQL data frames
          • convert SparkSQL data frames into objects (very inefficient, but good to start)
            • gave up on SparkSQL direct analysis
            • Bacon ttree is complex structure, SQL is inconvenient to handle these
          • python program to filter events ➜ filtered events then can be plotted in python
      • for first milestone, about 10-15 datasets totaling about 100 GB need to be used ➜ see action items
    • analysis (histogram plotting)
      • there are no nice plotting libraries in scala and java
      • Cristina: 
        • use python plotting libraries, everything except pyroot
      • Saba:
        • SparkR
      • Aleksey
        • python
    • performance discussion
      • skimming part should be moved from python to Scala as soon as possible

 

  • hardware/facilities
    • to scale up:
      • data ingest needs to be more efficient: JimP is working on it
      • we start needing a cluster, at 100 GB working on a local instance on your laptop becomes difficult
      • also 100 GB does not fit into memory anymore, rapid iterations also need a cluster
    • concentrating on Princeton test cluster
      • need access and 6 accounts, see action items
    • DataBrix academic
      • free small test cluster
      • link was posted in slack

 

 

  • publications
    • CHEP abstract(s): April 11st
      • OLI will write first draft
    • SC computing abstract: due July 1st
      • for NERSC tests and more advanced computational investigations

 

  • communication in slack
    • do everything public
    • comments are very helpful

 

  • Action items
    • Cristina is going to calculate the amount of data for the first plot (signal region) we need to load
      • various data and MC samples
      • possibly 100 GB
    • Aleksey
      • can we use the princeton test cluster for all our tests and development ➜ in principle yes
      • need to figure out first access for JimP
      • then we need 5 additional accounts
    • OLI:
      • create channels in slack: training channel, cluster channel (everything related to access and usage of facilities)
    • OLI:
      • start chep abstract
    • Cristina and JimP
      • setup github project for analysis code and other utilities
      • most of JimP’s work that is general will live in DIANA github project
    • Cristina, JimP and Matteo:
      • decouple dictionaries and BACON libraries from CMSSW
There are minutes attached to this event. Show them.
The agenda of this meeting is empty