Speaker
Description
Due to procurement at different stages, the computing infrastructure at the IHEP site is highly heterogeneous: the cluster contains multiple node models with varying capabilities, and the performance gap between nodes can be substantial. Traditional scheduling policies do not tightly couple hardware performance characteristics with job behavioral characteristics, which can lead to suboptimal placements—for example, I/O-intensive jobs occupying fast CPU nodes while CPU-sensitive jobs are dispatched to slower nodes. This mismatch results in avoidable waste of scarce computing resources.
To address this issue, our solution systematically inventories the site’s hardware resources and annotates each compute node with capability metrics—covering compute, storage, and network—via HTCondor ClassAds. Users can then declare required capability thresholds when requesting resources. In parallel, we perform large-scale job collection and classification across the cluster. For job types whose resource demand patterns are well understood, we preferentially schedule them onto the most suitable nodes, enabling precise job–node matching and improving overall resource utilization and cluster throughput.