9-13 July 2018
Sofia, Bulgaria
Europe/Sofia timezone

Disk failures in the EOS setup at CERN - A first systematic look at 1 year of collected data

9 Jul 2018, 15:00
Dirk Duellmann (CERN)


The EOS deployment at CERN is a core service used for both scientific data
processing, analysis and as back-end for general end-user storage (eg home directories/CERNBOX).
The collected disk failure metrics over a period of 1 year from a deployment
size of some 70k disks allows a first systematic analysis of the behaviour
of different hard disk types for the large CERN use-cases.

In this presentation we will describe the data collection and analysis,
summarise the measured rates and compare them with other large disk
deployments. In a second part of the presentation we will present a first
attempt to use the collected failure and SMART metrics to develop a machine
learning model predicting imminent failures and hence avoid service degradation
and repair costs.

