T. Smith (CERN)
This paper discusses the challenges in maintaining a stable Managed Storage Service for users built upon dynamic underlying disk and tape layers. Early in 2004 the tools and techniques used to manage disk, tape, and stage servers were refreshed in adopting the QUATTOR tool set. This has markedly increased the coherency and efficiency of the configuration of data servers. The LEMON monitoring suite was deployed to raise alarms and gather performance metrics. Exploiting this foundation, higher level service displays are being added, giving comprehensive and near-real-time views of operations. The scope of our monitoring has been broadened to include low-level machine sensors such as thermometer, IPMI and SMART readings, improving our ability to detect impending hardware failure. In terms of operations, widespread disk reliability problems which were manpower intensive to chase, were overcome by exchanging a bad batch of 1200 disks. Recent LHC data challenges have ventured into new operating domains for the CASTOR system, with massive disk resident file catalogues requiring special handling. The tape layer has focused on STK 9940 drives for bulk recording capacity: a large scale data migration to this media permitted old drive technologies to be retired. Repacking 9940A data to 9940B high density media allows us to recycle tapes, giving substantial savings by avoiding acquisition of new media. In addition to more robust software, hardware developments are required for LHC era services. We are moving from EIDE to SATA based disk storage and envisage a tape drive technology refresh. Details will be provided of our investigations in these areas.