Speaker
Dr
Silke Halstenberg
(Karlsruhe Institute of Technology)
Description
The dCache installation at GridKa, the German Tier-1 center, is ready for LHC data taking. After years of tuning and dry runs, several software and operational bottlenecks have been identified.
This contribution describes several procedures to improve stability and reliability of the Tier-1 storage setup. These range from redundant hardware and disaster planning over fine grained monitoring and automatic fault recovery to 24/7 on-call maintenance; therefore GridKa is expected to meet the required MOU targets with a minimum of administrator control.
Prior to updates a mirror setup is used to test and become familiar with new releases. The mirror setup is also used to replay scenarios for which problems have occurred. The role of the mirror system and its use is explained and evaluated. Error reports and trouble tickets are handled in an escalation procedure which involves operators, grid administrators and dCache experts. The workflow for solving tickets and fixing problems is described in detail. Also, we present an analysis and categorization of trouble tickets handled during the last two years that served to improve stability and service of the data management systems.
Presentation type (oral | poster) | oral |
---|
Author
Dr
Silke Halstenberg
(Karlsruhe Institute of Technology)
Co-authors
Dr
Christopher Jung
(Karlsruhe Institute of Technology)
Dr
Doris Ressmann
(Karlsruhe Institute of Technology)
Mrs
Stephanie Boehringer
(Karlsruhe Institute of Technology)