14-18 October 2013
Amsterdam, Beurs van Berlage
Europe/Amsterdam timezone

Automating usability of ATLAS Distributed Computing resources

14 Oct 2013, 17:25
22m
Graanbeurszaal (Amsterdam, Beurs van Berlage)

Graanbeurszaal

Amsterdam, Beurs van Berlage

Oral presentation to parallel session Distributed Processing and Data Handling A: Infrastructure, Sites, and Virtualization Distributed Processing and Data Handling A: Infrastructure, Sites, and Virtualization

Speaker

Dr Salvatore Tupputi (Universita e INFN (IT))

Description

The automation of ATLAS Distributed Computing (ADC) operations is essential to reduce manpower costs and allow performance-enhancing actions which improve the reliability of the system. In this perspective a crucial case is the automatic exclusion/recovery of ATLAS computing sites storage resources, which are continuously exploited at the edge of their capabilities. It is challenging to adopt unambiguous decision criteria for storage resources which feature non-homogeneous types, sizes and roles. The recently developed Storage Area Automatic Blacklisting (SAAB) tool has provided a suitable solution, by employing an inference algorithm which processes SAM (Site Availability Test) site-by-site SRM test outcomes. SAAB accomplishes both the tasks of providing global monitoring as well as automatic operations on single sites. The implementation of the SAAB tool has been the first step in a comprehensive review of the storage areas monitoring and central management at all levels. Such review has involved the reordering and optimization of SAM tests deployment and the inclusion of SAAB results in the ATLAS Site Status Board with both dedicated metrics and views. The final structure allows monitoring the storage resources status with fine time-granularity and automatic actions to be taken in foreseen cases, like automatic exclusion/recovery and notifications to sites. Hence, the human actions are restricted to ticket tracking and exchanging, where and when needed. In this work we show SAAB working principles and features. We present also the decrease of human interactions achieved within the ATLAS Computing Operations team. The automation results in a prompt reaction to failures, which grants the optimization of resource exploitation.

Primary author

Dr Salvatore Tupputi (Universita e INFN (IT))

Co-authors

Alessandro Di Girolamo (CERN) Jaroslava Schovancova (Brookhaven National Laboratory (US)) Tomas Kouba (Acad. of Sciences of the Czech Rep. (CZ))

Presentation Materials