Speaker
Description
The open-sharing and re-use of scientific data is ever more important, either to meet the demands of transparency and reproducibility, or to maximize the scientific return of large and small experiments. The FAIR principles (Findable, Accessible, Interoperable, Re-usable) require efficient data publication, discovery, and long-term preservation that often means costly duplication of data across storage and publishing platforms. In this contribution, we discuss how Rucio, the widely adopted distributed data management system, is being extended with native support for Open Data to address these challenges.
We present the architecture and workflow of the new “Rucio Open Data” capabilities: after tagging data as “open,” Rucio automatically applies the appropriate data placement, replication, metadata tagging, and exposure mechanisms required for public release without requiring data duplication or separate export procedures. This mechanism empowers the transition from internal data workflows to public data sharing.
We highlight how this approach simplifies compliance with FAIR principles and reduces operational overhead for experiments. We also discuss the integration with the existing CERN Open Data Portal (and similar open-data platforms), enabling experiments to publish data directly from their Rucio-managed storage to public repositories, with metadata and access policies managed consistently.
Finally, we showcase benefits for both new and legacy datasets: new experiments can plan open data sharing from day one and established collaborations can incrementally expose historical data without major re-engineering.