10–14 Oct 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Index files for Belle II - very small skim containers

13 Oct 2016, 14:45
15m
GG C2 (San Francisco Mariott)

GG C2

San Francisco Mariott

Oral Track 7: Middleware, Monitoring and Accounting Track 7: Middleware, Monitoring and Accounting

Speaker

Prof. Martin Sevior (University of Melbourne)

Description

The Belle II experiment will generate very large data samples. In order to reduce the time for data analyses, loose selection criteria will be used to create files rich in samples of particular interest for a specific data analysis (data skims). Even so, many of the resultant skims will be very large, particularly for highly inclusive analyses. The Belle II collaboration is investigating the use of “index-files” where instead of the skim recording the actual physics data, we record
pointers to events of interest in the complete data sample. This reduces the skim files by 2 orders of magnitude. While this approach was successfully employed by the Belle experiment where the majority of the data analysis was performed on a single center cluster, index files are significantly more challenging in a distributed computing system such as the Belle II computing grid.

The approach we use is for each skim file to record metadata identifying the original parent file, as well as the location of the event within the parent file. Since we employ the ROOT file container system it is straight forward to read just the identified events of interest. For remote file access, we employ the XROOTD protocol to both select identified events and transfer them to the worker nodes used for data analysis.

This presentation will describe the details of the implementation of index-files within the Belle II grid. We will also describe the results of tests where the analyses were performed locally and situations where data was transferred internationally, in particular, from Europe to Australia. The configuration of the XROOTD services for these tests has a very large impact on the performance of the system and will be reported in the presentation

Primary Keyword (Mandatory) Data processing workflows and frameworks/pipelines
Secondary Keyword (Optional) Computing middleware
Tertiary Keyword (Optional) Distributed workload management

Primary author

Prof. Martin Sevior (University of Melbourne)

Co-authors

Ms Chia-Ling Hsu (University of Melbourne) Dr Hideki Miyake (High Energy Accelerator Research Organization, Tsukuba, Japan) Prof. Ikuo Ueda (High Energy Accelerator Research Organization, Tsukuba, Japan) Prof. Takanori Hara (High Energy Accelerator Research Organization, Tsukuba, Japan) Prof. Thomas Kuhr (LMU, Munich, Germany) Mr Tristan Bloomfield (University of Melbourne)

Presentation materials