Speaker
Description
Managing job-slot allocation in a multi-VO environment remains a persistent operational challenge for WLCG sites, particularly when each Virtual Organization (VO) employs distinct workload-management and scheduling behaviors. At the RAL Tier-1 (RAL-LCG2), more than a dozen VOs—including CMS, ATLAS, LHCb, and several smaller communities—compete for heterogeneous resources while relying on subtly different submission patterns, priority models, and pilot-job strategies. Ensuring that each VO receives its guaranteed share, while simultaneously enabling opportunistic exploitation of unused capacity, requires a balancing strategy that extends beyond static fair-share configurations.
This contribution examines the complexities associated with harmonising these competing requirements, focusing on the interactions between VO-specific schedulers and site-level controls. We discuss the divergent workload characteristics of the major LHC VOs—such as CMS’s high pilot turnover, ATLAS’s multi-queue depth behaviour, and LHCb’s latency-sensitive submission logic—and how these differences influence both resource contention and backfill opportunities.
We present operational experience from the RAL Tier-1 in tuning HTCondor and ARC-CE parameters to shape multi-VO throughput, including dynamic slot-partitioning strategies, negotiator policy refinements, and queue-level throttling designed to preserve fairness while maximising utilisation. Results illustrate that achieving balanced allocation is non-trivial: naïve configurations can lead to persistent starvation, inefficient backfill, or pathological pilot cycling. The study demonstrates that sustained high efficiency in a multi-VO environment requires continuous calibration of site-level scheduling knobs, coordinated with VO-specific workload characteristics, to deliver both fair-share compliance and robust opportunistic use of spare capacity.