25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

A Lightweight Engine for Automated WLCG Resource Inclusion and Exclusion in ATLAS

27 May 2026, 15:15
1h
Chulalongkorn University

Chulalongkorn University

Poster Presentation Track 4 - Distributed computing Poster

Speaker

Dr Brij Kishor Jashal (Rutherford Appleton Laboratory)

Description

From Probes to Policy: Harmonising ATLAS Resource Health Signals

The operational status of WLCG resources in ATLAS is determined through several parallel mechanisms: probe results and declared downtimes (Switcher), persistent failures in functional or performance tests (HammerCloud), and data transfer or storage exclusion conditions managed by distributed data management (DDM). ATLAS Distributed Computing Operations use this information to decide whether a resource is suitable for production, should be quarantined, or requires exclusion. We present a lightweight decision engine that consolidates these signals into a single canonical state (production, degraded, excluded, or pending). The engine ingests probe outputs, HammerCloud job history, administrative downtime information, and DDM exclusion indicators. It produces an auditable decision trace using pluggable adapters, a heuristic scoring system, and optional last-N HammerCloud policy emulation. This unified approach removes ambiguity caused by divergent decision paths, supports quicker recovery from transient issues, and allows dry-run deployment prior to replacing existing daemon logic. The design is extensible: additional WLCG-wide probes such as FTS transfer health, Rucio availability, or storage endpoint reliability can be incorporated through new adapters. The deterministic evaluation process may also be adopted by other LHC experiments to support consistent and explainable resource governance across varied monitoring sources.

Author

Dr Brij Kishor Jashal (Rutherford Appleton Laboratory)

Presentation materials

There are no materials yet.