Speaker
Daniele Francesco Kruse
(CERN)
Description
Administrating a large-scale, multi-protocol, hierarchical tape storage infrastructure like the one at CERN, which stores around 30PB / year, requires an adequate monitoring system for quick spotting of malfunctions, easier debugging and on demand report generation. The main challenges for such system are: to cope with log format diversity and its information scattered among several log files, the need for long term information archival, the strict data consistency requirements and the group based GUI visualization. For this purpose, we have designed, developed and deployed a centralized system consisting of four independent layers: a Log Transfer layer for collecting log lines from all tape servers to a single aggregation server, a Data Mining layer for combining log data into transactional context, a Storage layer for archiving the resulting transactions and finally a Web UI layer for accessing the information. Having flexibility, extensibility and maintainability in mind, each layer is designed to work as a message broker for the next layer, providing a clean and generic interface while ensuring consistency, redundancy and ultimately fault tolerance. This system unifies information previously dispersed over several monitoring tools into a single user interface, using Splunk, which also allows us to provide information visualization based on access control lists (ACL). Since its deployment, it has been successfully used by CERN tape operators for quick overview of transactions, performance evaluation, malfunction detection and by managers for report generation. In this paper we present our design principles, problems with corresponding solutions, disaster cases and how we handle them, comparison with other solutions and future work that can be done.
Primary author
Fotios Nikolaidis
(University of Crete (GR))
Co-authors
Daniele Francesco Kruse
(CERN)
German Cancio Melia
(CERN)