Speaker
T. Smith
(CERN)
Description
This paper discusses the challenges in maintaining a stable Managed Storage Service
for users built upon dynamic underlying disk and tape layers.
Early in 2004 the tools and techniques used to manage disk, tape, and stage servers
were refreshed in adopting the QUATTOR tool set. This has markedly increased the
coherency and efficiency of the configuration of data servers. The LEMON
monitoring suite was deployed to raise alarms and gather performance metrics.
Exploiting this foundation, higher level service displays are being added, giving
comprehensive and near-real-time views of operations. The scope of our monitoring
has been broadened to include low-level machine sensors such as thermometer, IPMI
and SMART readings, improving our ability to detect impending hardware failure.
In terms of operations, widespread disk reliability problems which were manpower
intensive to chase, were overcome by exchanging a bad batch of 1200 disks. Recent
LHC data challenges have ventured into new operating domains for the CASTOR system,
with massive disk resident file catalogues requiring special handling. The tape
layer has focused on STK 9940 drives for bulk recording capacity: a large scale
data migration to this media permitted old drive technologies to be retired.
Repacking 9940A data to 9940B high density media allows us to recycle tapes, giving
substantial savings by avoiding acquisition of new media.
In addition to more robust software, hardware developments are required for LHC era
services. We are moving from EIDE to SATA based disk storage and envisage a tape
drive technology refresh. Details will be provided of our investigations in these
areas.
Primary authors
C. Curran
(CERN)
D. Hughes
(CERN)
G. Lee
(CERN)
H. Cacote
(CERN)
J. van Eldik
(CERN)
T. Osborne
(CERN)
T. Smith
(CERN)
V. Bahyl
(CERN)