HammerCloud is a framework to commission, test, and benchmark ATLAS computing resources and components of various distributed systems with realistic full-chain experiment workflows. HammerCloud contributes to ATLAS Distributed Computing (ADC) Operations and automation efforts, providing the automated resource exclusion and recovery tools, that help re-focus operational manpower to areas which have yet to be automated, and improve utilization of available computing resources.
We present recent evolution of the auto-exclusion/recovery tools: faster inclusion of new resources in testing machinery, machine learning algorithms for anomaly detection, categorized resources as master vs. slave for the purpose of blacklisting, and a tool for auto-exclusion/recovery of resources triggered by Event Service job failures that is being extended to other workflows besides the Event Service.
We describe how HammerCloud helped commissioning various concepts and components of distributed systems: simplified configuration of queues for workflows of different activities (unified queues), components of Pilot (new movers), components of AGIS (controller), distributed data management system (protocols, direct data access, ObjectStore tests).
We summarize updates that brought HammerCloud up to date with developments in ADC and improved its flexibility to adapt to the new activities and workflows to respond to evolving needs of the ADC Operations team in a timely manner.