Speaker
Description
Over the past 70 years, CERN’s pioneering work in particle physics and more than a decade of operations at the Large Hadron Collider (LHC) has driven a dramatic transformation in data storage. With each new experimental run, the scale and complexity of data handling continue to grow. As we approach the next Long Shutdown (LS3) and the High-Luminosity LHC (HL-LHC) era, storage infrastructure demands are expected to rise exponentially, bringing significant challenges and opportunities.
Today at CERN, we operate over 800 storage nodes across eight independent EOS instances, forming the backbone of data storage for experiments, services and users. Managing this infrastructure at the Exabyte scale requires robust monitoring, smart alerting systems and a deep understanding of system performance and operational behavior.
In this talk, we will take a behind-the-scenes look at the daily operations of CERN’s storage systems, exploring what it takes to keep EOS running reliably under extreme conditions. We will highlight the evolution of our operational tools/practices and how we are preparing for future requirements in scalability, performance and reliability. Key topics will include improvements in observability, automation, fault detection and incident response, essential components to support EOS as it scales to meet the demands of HL-LHC data workflows.