1–4 Mar 2021
Europe/Zurich timezone

SRE fundamentals in EOS

1 Mar 2021, 17:10
20m
EOS

Speaker

Hugo Gonzalez Labrador (CERN)

Description

The EOS system is an advanced distributed storage system that deals with many extreme uses-cases (massive data injection from the LHC, latency-critical online home directories and massive throughput accesses from batch farms).

EOS implements many site reliability engineering best practices to support these uses cases at scale and also to support the work done by the operations team maintaining the production clusters.

In this presentation we explain some of the functionalities implemented in the core of EOS (logging, retry mechanism, QoS) that allows a smooth operation of the service while accommodating the diverse use-cases cited above.

Author

Presentation materials