Conveners
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Ruslan Mashinistov (Brookhaven National Laboratory (US))
- Lucia Morganti
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Tigran Mkrtchyan (DESY)
- Lucia Morganti
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Samuel Cadellin Skipsey
- Ruslan Mashinistov (Brookhaven National Laboratory (US))
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Tigran Mkrtchyan (DESY)
- Lucia Morganti
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Samuel Cadellin Skipsey
- Lucia Morganti
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Samuel Cadellin Skipsey
- Tigran Mkrtchyan (DESY)
Parallel (Track 1): Data and Metadata Organization, Management and Access
- Ruslan Mashinistov (Brookhaven National Laboratory (US))
- Tigran Mkrtchyan (DESY)
Description
Data and Metadata Organization, Management and Access
-
Alessandra Forti (University of Manchester (GB))21/10/2024, 16:15Track 1 - Data and Metadata Organization, Management and AccessTalk
ATLAS is participating in the WLCG Data Challenges, a bi-yearly program established in 2021 to prepare for the data rates of the High Luminosity HL-LHC. In each challenge, transfer rates are increased to ensure preparedness for the full rates by 2029. The goal of the 2024 Data Challenge (DC24) was to reach 25% of the HL-LHC expected transfer rates, with each experiment deciding how to execute...
Go to contribution page -
Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))21/10/2024, 16:33Track 1 - Data and Metadata Organization, Management and AccessTalk
To verify the readiness of the data distribution infrastructure for the HL-LHC, which is planned to start in 2029, WLCG is organizing a series of data challenges with increasing throughput and complexity. This presentation addresses the contribution of CMS to Data Challenge 2024, which aims to reach 25% of the expected network throughput of the HL-LHC. During the challenge CMS tested various...
Go to contribution page -
Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN)21/10/2024, 16:51Track 1 - Data and Metadata Organization, Management and AccessTalk
ALICE introduced ground-breaking advances in data processing and storage requirements and presented the CERN IT data centre with new challenges with the highest data recording requirement of all experiments. For these reasons, the EOS O2 storage system was designed to be cost-efficient, highly redundant and maximise data resilience to keep data accessible even in the event of unexpected...
Go to contribution page -
Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Marian Babik (CERN), Tristan Sullivan (University of Victoria)21/10/2024, 17:09Track 1 - Data and Metadata Organization, Management and AccessTalk
High-Energy Physics (HEP) experiments rely on complex, global networks to interconnect collaborating sites, data centers, and scientific instruments. Managing these networks for data-intensive scientific projects presents significant challenges because of the ever-increasing volume of data transferred, diverse project requirements with varying quality of service needs, multi-domain...
Go to contribution page -
James William Walder (Science and Technology Facilities Council STFC (GB))21/10/2024, 17:27Track 1 - Data and Metadata Organization, Management and AccessTalk
To address the needs of forthcoming projects such as the Square Kilometre Array (SKA) and the HL-LHC, there is a critical demand for data transfer nodes (DTNs) to realise O(100)Gb/s of data movement. This high-throughput can be attained through combinations of increased concurrency of transfers and improvements in the speed of individual transfers. At the Rutherford Appleton Laboratory...
Go to contribution page -
Thomas Byrne, Thomas Jyothish (STFC)21/10/2024, 17:45Track 1 - Data and Metadata Organization, Management and AccessTalk
To address the need for high transfer throughput for projects such as the LHC experiments, including the upcoming HL-LHC, it is important to make optimal and sustainable use of our available capacity. Load balancing algorithms play a crucial role in distributing incoming network traffic across multiple servers, ensuring optimal resource utilization, preventing server overload, and enhancing...
Go to contribution page -
Dr Jaroslav Guenther (CERN)22/10/2024, 13:30Track 1 - Data and Metadata Organization, Management and AccessTalk
The CERN Tape Archive (CTA) scheduling system implements the workflow and lifecycle of Archive, Retrieve and Repack requests. The transient metadata for queued requests is stored in the Scheduler backend store (Scheduler DB). In our previous work, we presented the CTA Scheduler together with an objectstore-based implementation of the Scheduler DB. Now with four years of experience in...
Go to contribution page -
Joao Afonso (CERN)22/10/2024, 13:48Track 1 - Data and Metadata Organization, Management and AccessTalk
The latest tape hardware technologies (LTO-9, IBM TS1170) impose new constraints on the management of data archived to tape. In the past, new drives could read the previous one or even two generations of media, but this is no longer the case. This means that repacking older media to new media must be carried out on a more agressive schedule than in the past. An additional challenge is the...
Go to contribution page -
Mr Dorin-Daniel Lobontu22/10/2024, 14:06Track 1 - Data and Metadata Organization, Management and AccessTalk
Storing the ever-increasing amount of data generated by LHC experiments is still inconceivable without making use of the cost effective, though inherently complex, tape technology. GridKa tape storage system used to rely on IBM Spectrum Protect (SP). Due to a variety of limitations and to meet the even higher requirements of HL-LHC project, GridKa decided to switch from SP to High Performance...
Go to contribution page -
Xin Zhao (Brookhaven National Laboratory (US))22/10/2024, 14:24Track 1 - Data and Metadata Organization, Management and AccessTalk
The High Luminosity upgrade to the LHC (HL-LHC) is expected to generate scientific data on the scale of multiple exabytes. To tackle this unprecedented data storage challenge, the ATLAS experiment initiated the Data Carousel project in 2018. Data Carousel is a tape-driven workflow in which bulk production campaigns with input data resident on tape are executed by staging and promptly...
Go to contribution page -
Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US))22/10/2024, 14:42Track 1 - Data and Metadata Organization, Management and AccessTalk
The Vera Rubin Observatory is a very ambitious project. Using the world’s largest ground-based telescope, it will take two panoramic sweeps of the visible sky every three nights using a 3.2 Giga-pixel camera. The observation products will generate 15 PB of new data each year for 10 years. Accounting for reprocessing and related data products the total amount of critical data will reach several...
Go to contribution page -
Julien Leduc (CERN)22/10/2024, 15:00Track 1 - Data and Metadata Organization, Management and AccessTalk
Due to the increasing volume of physics data being produced, the LHC experiments are making more active use of archival storage. Constraints on available disk storage have motivated the evolution towards the "data carousel" and similar models. Datasets on tape are recalled multiple times for reprocessing and analysis, and this trend is expected to accelerate during the Hi-Lumi era (LHC Run-4...
Go to contribution page -
Dmitry Litvintsev (Fermi National Accelerator Lab. (US)), Mr Tigran Mkrtchyan (DESY)22/10/2024, 16:15Track 1 - Data and Metadata Organization, Management and AccessTalk
The dCache project provides open-source software deployed internationally
Go to contribution page
to satisfy ever-more demanding storage requirements. Its multifaceted
approach provides an integrated way of supporting different use-cases
with the same storage, from high throughput data ingest, data sharing
over wide area networks, efficient access from HPC clusters, and long
term data persistence on tertiary... -
Luca Bassi22/10/2024, 16:33Track 1 - Data and Metadata Organization, Management and AccessTalk
After the deprecation of the open-source Globus Toolkit used for GridFTP transfers, the WLCG community has shifted its focus to the HTTP protocol. The WebDAV protocol extends HTTP to create, move, copy and delete resources on web servers. StoRM WebDAV provides data storage access and management through the WebDAV protocol over a POSIX file system. Mainly designed to be used by the WLCG...
Go to contribution page -
Hugo Gonzalez Labrador (CERN)22/10/2024, 16:51Track 1 - Data and Metadata Organization, Management and AccessTalk
Managing the data deluge generated by large-scale scientific collaborations is a challenge. The Rucio Data Management platform is an open-source framework engineered to orchestrate the storage, distribution, and management of massive data volumes across a globally distributed computing infrastructure. Rucio meets the requirements of high-energy physics, astrophysics, genomics, and beyond,...
Go to contribution page -
Aashay Arora (Univ. of California San Diego (US))22/10/2024, 17:09Track 1 - Data and Metadata Organization, Management and AccessTalk
The data movement manager (DMM) is a prototype interface between the CERN developed data management software Rucio and the software defined networking (SDN) service SENSE by ESNet. It allows for SDN enabled high energy physics data flows using the existing worldwide LHC computing grid infrastructure. In addition to the key feature of DMM, namely transfer-priority based bandwidth allocation for...
Go to contribution page -
Katy Ellis (Science and Technology Facilities Council STFC (GB))22/10/2024, 17:27Track 1 - Data and Metadata Organization, Management and AccessTalk
The Large Hadron Collider (LHC) experiments rely heavily on the XRootD software suite for data transfer and streaming across the Worldwide LHC Computing Grid (WLCG) both within sites (LAN) and across sites (WAN). While XRootD offers extensive monitoring data, there's no single, unified monitoring tool for all experiments. This becomes increasingly critical as network usage grows, and with the...
Go to contribution page -
Mihai Patrascoiu (CERN)22/10/2024, 17:45Track 1 - Data and Metadata Organization, Management and AccessTalk
The WLCG community, with the main LHC experiments at the forefront, is moving away from x509 certificates, replacing the Authentication and Authorization layer with OAuth2 tokens. FTS, as a middleware and core component of the WLCG, plays a crucial role in the transition from x509 proxy certificates to tokens. The paper will present in-detail the FTS token design and how this will serve the...
Go to contribution page -
Hasan Ozturk (CERN)23/10/2024, 13:30Track 1 - Data and Metadata Organization, Management and AccessTalk
The CMS experiment manages a large-scale data infrastructure, currently handling over 200 PB of disk and 500 PB of tape storage and transferring more than 1 PB of data per day on average between various WLCG sites. Utilizing Rucio for high-level data management, FTS for data transfers, and a variety of storage and network technologies at the sites, CMS confronts inevitable challenges due to...
Go to contribution page -
Wenlong Yuan (The University of Edinburgh (GB))23/10/2024, 13:48Track 1 - Data and Metadata Organization, Management and AccessTalk
The Deep Underground Neutrino Experiment (DUNE) is scheduled to start running in 2029, expected to record 30 PB/year of raw data. To handle this large-scale data, DUNE has adopted and deployed Rucio, the next-generation Data Replica service originally designed by the ATLAS collaboration, as an essential component of its Distributed Data Management system.
DUNE's use of Rucio has demanded...
Go to contribution page -
Rose Cooper23/10/2024, 14:06Track 1 - Data and Metadata Organization, Management and AccessTalk
The File Transfer Service (FTS) is a bulk data mover responsible for queuing, scheduling, dispatching and retrying file transfer requests, making it a critical infrastructure component for many experiments. FTS is primarily used by the LHC experiments, namely ATLAS, CMS and LHCb, but is also used by some non-LHC experiments, including both AMS and DUNE. FTS is as an essential part in the data...
Go to contribution page -
Lia Lavezzi (INFN Torino (IT))23/10/2024, 14:24Track 1 - Data and Metadata Organization, Management and AccessTalk
Modern physics experiments are often led by large collaborations including scientists and institutions from different parts of the world. To cope with the ever increasing computing and storage demands, computing resources are nowadays offered as part of a distributed infrastructure. Einstein Telescope (ET) is a future third-generation interferometer for gravitational wave (GW) detection, and...
Go to contribution page -
Fabio Hernandez (IN2P3 / CNRS computing centre)23/10/2024, 14:42Track 1 - Data and Metadata Organization, Management and AccessTalk
The set of sky images recorded nightly by the camera mounted on the telescope of the [Vera C. Rubin Observatory][1] will be processed in facilities located on three continents. Data acquisition will happen in Cerro Pachón in the Andes mountains in Chile where the observatory is located. A first copy of the raw image data set is stored at the summit site of the observatory and immediately...
Go to contribution page -
Tristan Bloomfield (KEK IPNS)23/10/2024, 15:00Track 1 - Data and Metadata Organization, Management and AccessTalk
The Belle II raw data transfer system is responsible for transferring raw data from the Belle II detector to the local KEK computing centre, and from there to the GRID. The Belle II experiment recently completed its first Long Shutdown period - during this time many upgrades were made to the detector and tools used to handle and analyse the data. The Belle II data acquisition (DAQ) systems...
Go to contribution page -
Marcin Nowak (Brookhaven National Laboratory (US))23/10/2024, 16:15Track 1 - Data and Metadata Organization, Management and AccessTalk
Since the start of LHC in 2008, the ATLAS experiment has relied on ROOT to provide storage technology for all its processed event data. Internally, ROOT files are organized around TTree structures that are capable of storing complex C++ objects. The capabilities of TTrees developed over the years and are now offering support for advanced concepts like polymorphism, schema evolution and user...
Go to contribution page -
Nick Smith (Fermi National Accelerator Lab. (US))23/10/2024, 16:33Track 1 - Data and Metadata Organization, Management and AccessTalk
ROOT is planning to move from TTree to RNTuple as the data storage format for HL-LHC in order to, for example, speed up the IO, make the files smaller, and have a modern C++ API. Initially, RNTuple was not planned to support the same set of C++ data structures as TTree supports. CMS has explored the necessary transformations in its standard persistent data types to switch to RNTuple. Many...
Go to contribution page -
Dr Byrav Ramamurthy (University of Nebraska-Lincoln)23/10/2024, 16:51Track 1 - Data and Metadata Organization, Management and AccessTalk
Although caching-based efforts [1] have been in place in the LHC infrastructure in the US, we show that integrating intelligent prefetching and targeted dataset placement into the underlying caching strategy can improve job efficiency further. Newer experiments and experiment upgrades such as HL-LHC and DUNE are expected to produce 10x the amount of data than currently being produced. This...
Go to contribution page -
Maciej Pawel Szymanski (Argonne National Laboratory (US))23/10/2024, 17:09Track 1 - Data and Metadata Organization, Management and AccessTalk
The High-Luminosity upgrade of the Large Hadron Collider (HL-LHC) will increase luminosity and the number of events by an order of magnitude, demanding more concurrent processing. Event processing is trivially parallel, but metadata handling is more complex and breaks that parallelism. However, correct and reliable in-file metadata is crucial for all workflows of the experiment, enabling tasks...
Go to contribution page -
Mr Fabian Lambert (LPSC Grenoble IN2P3/CNRS (FR))23/10/2024, 17:27Track 1 - Data and Metadata Organization, Management and AccessTalk
The ATLAS Metadata Interface (AMI) is a comprehensive ecosystem designed for metadata aggregation, transformation, and cataloging. With over 20 years of feedback in the LHC context, it is particularly well-suited for scientific experiments that generate large volumes of data.
This presentation explains, in a general manner, why managing metadata is essential regardless of the experiment's...
Go to contribution page -
Lorenzo Rinaldi (Universita e INFN, Bologna (IT)), Luciano Gaido23/10/2024, 17:45Track 1 - Data and Metadata Organization, Management and AccessTalk
Large international collaborations in the field of Nuclear and Subnuclear Physics have been leading the implementation of FAIR principles for managing research data. These principles are essential when dealing with large volumes of data over extended periods and involving scientists from multiple countries. Recently, smaller communities and individual experiments have also started adopting...
Go to contribution page -
John Wu (LAWRENCE BERKELEY NATIONAL LABORATORY)24/10/2024, 13:30Track 1 - Data and Metadata Organization, Management and AccessTalk
The surge in data volumes from large scientific collaborations, like the Large Hadron Collider (LHC), poses challenges and opportunities for High Energy Physics (HEP). With annual data projected to grow thirty-fold by 2028, efficient data management is paramount. The HEP community heavily relies on wide-area networks for global data distribution, often resulting in redundant long-distance...
Go to contribution page -
Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))24/10/2024, 13:48Track 1 - Data and Metadata Organization, Management and AccessTalk
The Large Hadron Collider (LHC) at CERN in Geneva is preparing for a major upgrade that will improve both its accelerator and particle detectors. This strategic move comes in anticipation of a tenfold increase in proton-proton collisions, expected to kick off by 2029 in the upcoming high-luminosity phase. The backbone of this evolution is the World-Wide LHC Computing Grid, crucial for handling...
Go to contribution page -
Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)24/10/2024, 14:06Track 1 - Data and Metadata Organization, Management and AccessTalk
Scientific experiments and computations, especially in High Energy Physics, are generating and accumulating data at an unprecedented rate. Effectively managing this vast volume of data while ensuring efficient data analysis poses a significant challenge for data centers, which must integrate various storage technologies. This paper proposes addressing this challenge by designing a multi-tiered...
Go to contribution page -
Michelle Ann Solis (University of Arizona (US))24/10/2024, 14:24Track 1 - Data and Metadata Organization, Management and AccessTalk
This paper presents a novel approach to enhance the analysis of ATLAS Detector Control System (DCS) data at CERN. Traditional storage in Oracle databases, optimized for WinCC archiver operations, is challenged by the need for extensive analysis across long timeframes and multiple devices, alongside correlating conditions data. We introduce techniques to improve troubleshooting and analysis of...
Go to contribution page -
Tatiana Ovsiannikova (University of Washington (US))24/10/2024, 14:42Track 1 - Data and Metadata Organization, Management and AccessTalk
Over the past years, the ROOT team has been developing a new I/O format called RNTuple to store data from experiments at CERN's Large Hadron Collider. RNTuple is designed to improve ROOT's existing TTree I/O subsystem by improving I/O speed and introducing a more efficient binary data format. It can be stored in both ROOT files and object stores, and it's optimized for modern storage hardware...
Go to contribution page -
Andrzej Nowicki (CERN)24/10/2024, 16:15Track 1 - Data and Metadata Organization, Management and AccessTalk
In this presentation, I will outline the upcoming transformations set to take place within CERN's database infrastructure. Among the challenges facing our database team during the Long Shutdown 3 (LS3) will be the upgrade of Oracle databases.
The forthcoming version of Oracle database is introducing a significant internal change as the databases will be converted to a container...
Go to contribution page -
Guilherme Amadio (CERN)24/10/2024, 16:33Track 1 - Data and Metadata Organization, Management and AccessTalk
Remote file access is critical in High Energy Physics (HEP) and is currently facilitated by XRootD and HTTP(S) protocols. With a tenfold increase in data volume expected for Run-4, higher throughput is critical. We compare some client-server implementations on 100GE LANs connected to high-throughput storage devices. A joint project between IT and EP departments aims to evaluate RNTuple as a...
Go to contribution page -
Zachary Goggin24/10/2024, 16:51Track 1 - Data and Metadata Organization, Management and AccessTalk
The recent commissioning of CERN’s Prevessin Data Centre (PDC) brings the opportunity for multi-datacentre Ceph deployements, bringing advantages for business continuity and disaster recovery. However, the simple extension of a single cluster across data centres is impractical due to the impact of latency on Ceph’s strong consistency requirements. This paper reports on our research towards...
Go to contribution page -
Matt Doidge (Lancaster University (GB))24/10/2024, 17:09Track 1 - Data and Metadata Organization, Management and AccessTalk
Erasure-coded storage systems based on Ceph have become a mainstay within UK Grid sites as a means of providing bulk data storage whilst maintaining a good balance between data safety and space efficiency. A favoured deployment, as used at the Lancaster Tier-2 WLCG site, is to use CephFS mounted on frontend XRootD gateways as a means of presenting this storage to grid users.
These storage...
Go to contribution page -
Dr Michał Orzechowski (AGH University of Krakow, Faculty of Computer Science, Poland)24/10/2024, 17:27Track 1 - Data and Metadata Organization, Management and AccessTalk
Onedata [1] platform is a high-performance data management system with a distributed, global infrastructure that enables users to access heterogeneous storage resources worldwide. It supports various use cases ranging from personal data management to data-intensive scientific computations. Onedata has a fully distributed architecture that facilitates the creation of a hybrid cloud...
Go to contribution page -
Samuel Cadellin Skipsey24/10/2024, 17:45Track 1 - Data and Metadata Organization, Management and AccessTalk
In order to achieve the higher performance year on year required by the 2030s for future LHC upgrades at a sustainable carbon cost
Go to contribution page
to the environment, it is essential to start with accurate measurements of the state of play. Whilst there have been a number of studies
of the carbon cost of compute for WLCG workloads published, rather less has been said on the topic of storage, both nearline...