Speaker
Description
HammerCloud (HC) is a framework for testing and benchmarking resources of the world wide LHC computing grid (WLCG). It tests the computing resources and the various components of distributed systems with workloads that can range from very simple functional tests to full-chain experiment workflows. This contribution concentrates on the ATLAS implementation, which makes extensive use of HC for monitoring global resources, and additionally, has implemented a mechanism to automatically exclude resources if certain critical tests fail. The auto-exclusion mechanism makes it possible to save resources by avoiding sending computationally intensive jobs to non-functioning clusters.
However, in some cases central errors of the distributed computing system lead to massive exclusions of otherwise well-functioning resources. A new feature improves the recovery after such mass-exclusion events. For the auto-exclusion mechanism to be effective and save resources, test jobs need to be sent at a sufficient frequency. This in turn also uses resources. In this contribution, we give an estimate of the total balance of resources of the auto-exclusion system and explore possible optimisations.
Individual services and scripts have been reorganised as part of a general overhaul including containerisation and the web interface has been given a facelift after more than 10 years of operation. This contribution summarises the work needed to get HC ready for the next decade.