Oct 14 – 18, 2013
Amsterdam, Beurs van Berlage
Europe/Amsterdam timezone

Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs

Oct 14, 2013, 3:00 PM
Grote zaal (Amsterdam, Beurs van Berlage)

Grote zaal

Amsterdam, Beurs van Berlage

Poster presentation Facilities, Production Infrastructures, Networking and Collaborative Tools Poster presentations


Maria Dimou (CERN)


In the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availability of resources at the Tier0 or the Tier1s can seriously harm not only WLCG Operations but also the experiments' workflow and the storage of LHC data which are very expensive to reproduce. This is why availability requirements for these sites are high and committed in the WLCG Memorandum of Understanding (MoU). In this talk we describe the workflow of GGUS ALARMs, the only 24/7 mechanism available to LHC experiment experts for reporting to the Tier0 or the Tier1s problems with their Critical Services. Conclusions and experience gained from the detailed drills performed in each such ALARM for the last 4 years will be explained and the shift with time of Type of Problems met. The physical infrastructure put in place to achieve GGUS 24/7 availability will be summarised.

Primary author


Guenter Grein (KIT - Karlsruhe Institute of Technology (DE)) Helmut Dres (KIT - Karlsruhe Institute of Technology (DE)) Mr Oleg Dulov (KIT)

Presentation materials