Speaker
Description
Authors:
- Martin Øines Eide, Western Norway University of Applied Sciences,
University of Bergen, Bergen, Norway and European Organization for
Nuclear Research (CERN), Geneva, Switzerland
- Costin Grigoras, European Organization for Nuclear Research (CERN), Geneva,
Switzerland
on behalf of the ALICE collaboration
The ALICE experiment at CERN relies on a central service known as the Calibration and Conditions Database (CCDB).This service acts as a single, uniform source of data essential for online and offline reconstruction, analysis, and other crucial tasks within the experiment. Currently, the CCDB is fully operational and has successfully managed a heavy workload, serving thousands of requests per second across the online and the distributed offline Grid environment. Due to the centralized nature of the CCDB service combined with the distributed execution of ALICE Grid jobs, connectivity is a significant concern - jobs occasionally encounter connectivity issues when attempting to access the CCDB. Furthermore, the practice of redundant lookups, where multiple jobs or even the same job repeatedly request identical pieces of calibration or conditions data, imposes an unnecessary load on the central service. To mitigate these operational challenges, the ALICE team is actively investigating and implementing a caching solution.
This work details the specific technical improvements made to the CCDB usage tracking and analysis mechanisms, which were necessary to properly characterize the service's workload and optimize the caching strategy. In particular, to maintain the reliability and responsiveness of the CCDB in the face of immense Grid job traffic, rigorous connection monitoring tracking key network and database metrics, such as the latency for establishing a connection, the duration of open connections, and the frequency of connection timeouts or failures experienced by distributed Grid jobs was implemented. By closely monitoring these parameters, the system can identify and flag specific regions or job types prone to connectivity issues, allowing for targeted network or service adjustments. This detailed monitoring directly informs the assessment of database performance and the choice of the caching solution itself, along with its architecture and successful integration into the ALICE Grid middleware, JAliEn.