Analytics WG meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map

Attending:

Dirk, Christian, Luca, Evangelos, Rainer, Bernd, Manuel, Jerome, Domenico, Sebastian, Tony, Raul, Maria, Valentin, Daniele


Minutes:

    - TWiki is up, everybody should have a look and put some information about their data sources
    - The experiments want to join our efforts, today we had Tony from CMS, next week we will have a representative of ATLAS
    - Next week, we will have some talks about analysis on Hadoop, further talks are welcome

CMS Analytics:

    - Dirk asked if the CMS data set would fit into the cluster. The dataset should be ~100GB, so no problem.
    - The question of dealing with confidential/private was raised. Options:
        - Anonymise the data on CMS side (string to id's) and share it
        - Keep critical part of the data in a separate folder, with restricted access to CMS representatives (by unix user/group)
        - Parallel installation for CMS, but sharing experience with Hadoop
    - CMS will discuss and select the most appropriate way to go

LanDB Data in Hadoop:

    - CS will add additional locality information (basically a resolution of the "Room" attribute for ease of use)

AWG Repository prototype:

    - The access tool should stick to the minimal requirements for now:
        - Show available data sets and schemata
        - Extract time periods
        - Simple selection on attributes (>,<,= etc.)
        - "Typical" extraction formats

    - For advanced analysis on the Hadoop cluster:
        - Create technology overview with examples/tutorials
        - In the end, complex analysis has to be implemented by the analyst
        - ...but we should exchange experience within the group

There are minutes attached to this event. Show them.
    • 14:00 14:05
      Minutes & News 5m
      Speaker: Dirk Duellmann (CERN)
    • 14:05 14:25
      Data management analytics - topics in CMS 20m
      Speaker: Dr Tony Wildish (Princeton University (US))
      Slides
    • 14:25 14:35
      Next steps for additional data types 10m
      Speaker: Pedro Andrade (CERN)
      Slides
    • 14:35 14:45
      LANDB snapshots in HDFS 10m
      Speaker: Rainer Toebbicke (CERN)
      Slides
    • 14:45 15:05
      Repository prototype - status update 20m
      Speaker: Luca Menichetti (CERN)
      Slides