Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

10–15 Mar 2019
Steinmatte conference center
Europe/Zurich timezone

An innovative monitoring and maintenance model for the INFN CNAF Tier-1 data center infrastructure.

Not scheduled
20m
Steinmatte conference center

Steinmatte conference center

Hotel Allalin, Saas Fee, Switzerland https://allalin.ch/conference/
Poster Track 1: Computing Technology for Physics Research Poster Session

Speaker

Pier Paolo Ricci (INFN CNAF)

Description

During the last years we have carried out a renewal of the Building Management System (BMS) software of our data center with the aim of improving the data collection capability. Considering the complex physical distribution of the technical plants and the limits of the actual building hosting our center, a system that simply monitors and collects all the necessary information and provides alarms only in case of major failures has proven to be unsatisfactory. In 2017 we suffered a major flood due to one main water pipeline failure in the public street. After this disastrous event, clearly far beyond our control, we were however forced to reconsider completely the physical site robustness of our building in addition to the current monitoring and alarm system capabilities. It was clear that in some specific cases, alerts should be triggered hours or days before the actual main problem arises in order to allow efficient human intervention and proper escalation process. This paradigm could be easily applied to almost all the infrastructure components in our site, mainly the electric power distribution and continuity systems as well as the whole cooling devices. For this reason, in parallel to a consistent increase of the sensor capillarity of our BMS data collector system, a study of a predictive maintenance approach applicability to our site has been started. Predictive maintenance techniques aims at prevent unexpected infrastructure components failures or major events with the study of the whole monitoring data collection and the creation of appropriate statistical models with the help of big data analysis and machine learning techniques. An improvement in the power distribution unit monitoring in our site and the introduction of a dedicated network of water leak sensors were the first steps for increasing the data collection information at our disposal. In addition, a high definition closed-circuit television (CCTV) system with recording capability was introduced to improve the data center remote surveillance and retrospective problem analysis. With sufficient monitoring statistical information stored in our BMS system a preliminary and exploratory predictive data analysis proof of concept could be constructed. This could lead to the model building phase and the creation of a prototype with the aim of forecasting future infrastructure main failure events and forthcoming error conditions. The general idea is, conceivably, an approach to the predictive maintenance model where it would be possible to introduce scheduled corrective actions for the purpose of preventing potential failures in the next future and increasing the site overall reliability.

Primary author

Pier Paolo Ricci (INFN CNAF)

Presentation materials

There are no materials yet.

Peer reviewing

Paper