Speaker
Description
Efficient data access is becoming increasingly important for high-energy physics (HEP) workflows on HPC systems. Large datasets, a greater degree of concurrency (multi-process and multithreading), and complex event formats can lead to hidden performance issues. The HEP-CCE/SOP group used the Darshan I/O characterization tool to identify data re-operations in representative HEP workflows, using ATLAS and CMS production workflows as case studies, to quantify access locality and measure the impact of repeated I/O on job walltime. Complementary instrumentation of ROOT-based file formats (TTree and RNTuple) enables variable and event range level analysis, revealing access patterns that inform content slimming and restructuring strategies.
To detect regressions over time, we extend the ATLAS release-level performance monitoring with continuous re-operation metrics, exposing inefficiencies introduced by software changes and configuration defaults. Initial studies across multiple HPC platforms demonstrate observable correlations between access entropy, cluster granularity, and end-to-end runtime.
This work provides a scalable methodology for detecting and diagnosing I/O bottlenecks, guiding workflow optimization, and improving resource utilization of HEP experiments as data volumes and HPC concurrency continue to grow in the exascale era and beyond.