Speaker
L. Tuura
(NORTHEASTERN UNIVERSITY, BOSTON, MA, USA)
Description
Experiments frequently produce many small data files for reasons beyond their control, such as output
splitting into physics data streams, parallel processing on large farms, database technology incapable of
concurrent writes into a single file, and constraints from running farms reliably. Resulting data file size is
often far from ideal for network transfer and mass storage performance. Provided that time to analysis does
not significantly deteriorate, files arriving from a farm could easily be merged into larger logical chunks, for
example by physics stream and file type within a configurable time and size window.
Uncompressed zip archives seem an attractive candidate for such file merging and are currently tested by the
CMS experiment. We describe the main components now in use: the merging tools, tools to read and write zip
files directly from C++, plug-ins to the database system, mass-storage access optimisation, consistent
handling of application and replica metadata, and integration with catalogues and other grid tools. We report
on the file size ratio obtained in the CMS 2004 data challenge and observations and analysis on changes to
data access as well as estimated impact on network usage.
Authors
L. Tuura
(NORTHEASTERN UNIVERSITY, BOSTON, MA, USA)
T. Barrass
(Bristol University, UK)
V. Innocente
(CERN, PH/SFT)