Speaker
K. Wu
(LAWRENCE BERKELEY NATIONAL LAB)
Description
Nuclear and High Energy Physics experiments such as STAR at BNL are
generating millions of files with PetaBytes of data each year. In
most cases, analysis programs have to read all events in a file in
order to find the interesting ones.
Since most analyses are only interested in some subsets of events in
a number of files, a significant portion of the computer time is
wasted on reading the unwanted events. To address this issue, we
developed a software system called the Grid Collector. The core of
the Grid Collector is an "Event Catalog".
This catalog can be efficiently searched with compressed bitmap
indices. Tests show that it can index and search STAR event data
much faster than database systems.
It is fully integrated with an existing analysis framework so that a
minimal effort is required to use the Grid Collector in an analysis
program. In addition, by taking advantage of existing file catalogs,
Storage Resource Managers (SRMs) and GridFTP, the Grid Collector
automatically downloads the needed files anywhere on the Grid without
user intervention.
The Grid Collector can significantly improve user productivity. The
improvement in productivity is more significant as users converge
toward searching for rare events, because only the rare events are
read into memory and the necessary files are automatically located
and downloaded through the best available route. For a user that
typically performs computation on 50% of the events, using the Grid
Collector could reduce the turn around time by a half.
Primary authors
A. Shoshani
(Lawrence Berkeley National Lab)
K. Wu
(LAWRENCE BERKELEY NATIONAL LAB)
V. Perevoztchikov
(Brookhaven National Lab)
W-M. Zhang
(Kent State University)