28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Name: 28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)
Start: 2026-05-25T08:00:00+07:00
End: 2026-05-29T14:00:00+07:00
Location: Chulalongkorn University

25–29 May 2026

Chulalongkorn University

Asia/Bangkok timezone

Intelligent Orchestration of Petabyte-Scale Data Staging for Physics Workflows

25 May 2026, 14:39

18m

Chulalongkorn University

Oral Presentation Track 1 - Data and metadata organization, management and access Track 1 - Data and metadata organization, management and access

Alice-Florenta Suiu (National University of Science and Technology POLITEHNICA Bucharest (RO))

The ALICE detector at the CERN LHC generates petabyte-scale raw datasets during heavy-ion collision runs, which must undergo a multi-stage offline reconstruction cycle. EOSALICEO2 serves as the primary high-performance disk buffer for ALICE operations, both during data taking and data processing, providing the sustained throughput necessary for large-scale parallel reconstruction workflows. These workflows require an aggregate read throughput of approximately 200 GB/s to support ~100,000 parallel CPU cores, a performance level achievable only through the EOSALICEO2 disk buffer. However, existing tape-based archival systems, distributed across one T0 site (5 PB buffer) and six T1 sites (200–870 TB buffers each), lack sufficient individual buffer and throughput capacity for the amount of data necessary for one reconstruction cycle. This distributed architecture, where data is replicated across sites for redundancy, renders large-scale asynchronous reconstruction infeasible without dynamic staging capabilities to coordinate data recall across multiple custodial systems.

This contribution presents a centralized data staging service designed to address this critical bottleneck by automating the recall and transfer of raw data files from custodial tape storage to EOSALICEO2. The system features a web-based operator interface for submitting staging requests, intelligent batch sizing adapted to each tape system's buffer capacity, and comprehensive state management with multi-level retry mechanisms. Performance analysis demonstrates that the combined tape infrastructure provides an aggregate throughput of 6.54 GB/s. ALICE's largest datasets originate from Pb–Pb collision runs, with data-taking periods producing approximately 70 PB of raw data. With individual tape buffers requiring 11–20 days to fill and complete period staging spanning 5-6 months, the system implements a 1-month retry window for tape recalls to handle buffer contention and up to 10 attempts per file transfer to ensure reliable operations across this extended timeline. By enabling reliable petabyte-scale staging, this system makes large-scale asynchronous reconstruction workflows operationally feasible for the first time. Integration with existing data management tools supports a continuous workflow cycle where reconstructed data can be automatically removed from EOSALICEO2 to free buffer space for staging additional data. The system, however, is designed generically and can target alternative storage backends if those endpoints meet similar processing requirements.

Alice-Florenta Suiu (National University of Science and Technology POLITEHNICA Bucharest (RO)) Costin Grigoras (CERN) Latchezar Betev (CERN) Nicolae Tapus (Universitatea Nationala de Stiinta si Tehnologie Politehnica Bucuresti (RO))

There are no materials yet.

28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Intelligent Orchestration of Petabyte-Scale Data Staging for Physics Workflows

Chulalongkorn University

Speaker

Description

Authors

Presentation materials