CERN Accelerating science

ATLAS Slides
Report number ATL-SOFT-SLIDE-2018-392
Title Improving ATLAS computing resource utilization with HammerCloud
Author(s) Schovancova, Jaroslava (CERN Tier-0) ; Buehrer, Felix (Albert-Ludwigs-Universitaet Freiburg) ; Caballero-Bejar, Jose (Brookhaven National Laboratory (BNL)) ; Duckeck, Guenter (Fakultaet fuer Physik, Ludwig-Maximilians-Universitaet Muenchen) ; Fkiaras, Aristeidis (Athens University of Economics and Business (GR)) ; Legger, Federica (Fakultaet fuer Physik, Ludwig-Maximilians-Universitaet Muenchen) ; Maier, Thomas (Fakultaet fuer Physik, Ludwig-Maximilians-Universitaet Muenchen) ; Mancinelli, Valentina ; Sciacca, Francesco Giovanni (University of Bern, Albert Einstein Center for Fundamental Physics, Laboratory for High Energy Physics) ; Yusta Espla, Antonio (Fakultaet fuer Physik, Ludwig-Maximilians-Universitaet Muenchen)
Corporate author(s) The ATLAS collaboration
Collaboration ATLAS Collaboration
Submitted to 23rd International Conference on Computing in High Energy and Nuclear Physics, CHEP 2018, Sofia, Bulgaria, 9 - 13 Jul 2018
Submitted by jaroslava.schovancova@cern.ch on 21 Jun 2018
Subject category Particle Physics - Experiment
Accelerator/Facility, Experiment CERN LHC ; ATLAS
Free keywords atlas distributed computing ; testing ; commissioning ; hammercloud
Abstract HammerCloud is a framework to commission, test, and benchmark ATLAS computing resources and components of various distributed systems with realistic full-chain experiment workflows. HammerCloud contributes to ATLAS Distributed Computing (ADC) Operations and automation efforts, providing the automated resource exclusion and recovery tools, that help re-focus operational manpower to areas which have yet to be automated, and improve utilization of available computing resources. We present recent evolution of the auto-exclusion/recovery tools: faster inclusion of new resources in testing machinery, machine learning algorithms for anomaly detection, categorized resources as master vs. slave for the purpose of blacklisting, and a tool for auto-exclusion/recovery of resources triggered by Event Service job failures that is being extended to other workflows besides the Event Service. We describe how HammerCloud helped commissioning various concepts and components of distributed systems: simplified configuration of queues for workflows of different activities (unified queues), components of Pilot (new movers), components of AGIS (controller), distributed data management system (protocols, direct data access, ObjectStore tests). We summarize updates that brought HammerCloud up to date with developments in ADC and improved its flexibility to adapt to the new activities and workflows to respond to evolving needs of the ADC Operations team in a timely manner.



 Record created 2018-06-21, last modified 2018-06-21