Speakers
Description
The rapid growth of data volumes in high-energy physics (HEP) collaborations, such as the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC), has necessitated the adoption of regional in-network caching strategies to mitigate data access latency. However, these caches often exhibit varying efficiencies across locations due to differing access patterns and storage policies. Improving resource utilization could significantly increase the performance of scientific computing infrastructure — yet exploring what-if scenarios for capacity planning has remained challenging.
This study investigates cache utilization patterns across three regional caches supporting the CMS experiment, situated in Southern California, Chicago, and Boston. We have developed two complementary prediction methodologies to forecast cache hit rates under hypothetical storage capacities: an LSTM-based model employing transfer learning, and a simpler analytical approach leveraging the footprint of active files for estimating cache hits. The transfer learning methodology utilizes observed modifications in storage capacity at the Southern California site to inform predictions for the Chicago and Boston caches, which have maintained their original capacities. A central contribution of this work is the application of these two distinct prediction techniques to cross-validate the results, thereby enhancing confidence in the what-if scenario analyses.
Our findings demonstrate that a two-fold increase in the storage capacity of the Chicago cache could potentially elevate its cache hit rates from 50% to 80%, significantly improving resource utilization. The integration of machine learning and analytical techniques presented herein offers a validated framework for optimizing cache efficiency, informing resource allocation, and guiding future cache deployments and resource management strategies within large-scale scientific collaborations.