Conveners
Parallel (Track 7): Computing Infrastructure
- Henryk Giemza (National Centre for Nuclear Research (PL))
- Flavio Pisani (CERN)
Parallel (Track 7): Computing Infrastructure
- Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))
- Bruno Heinrich Hoeft (KIT - Karlsruhe Institute of Technology (DE))
Parallel (Track 7): Computing Infrastructure
- Henryk Giemza (National Centre for Nuclear Research (PL))
- Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))
Parallel (Track 7): Computing Infrastructure
- Bruno Heinrich Hoeft (KIT - Karlsruhe Institute of Technology (DE))
- Flavio Pisani (CERN)
Parallel (Track 7): Computing Infrastructure
- Bruno Heinrich Hoeft (KIT - Karlsruhe Institute of Technology (DE))
- Flavio Pisani (CERN)
Parallel (Track 7): Computing Infrastructure
- Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))
- Henryk Giemza (National Centre for Nuclear Research (PL))
Parallel (Track 7): Computing Infrastructure
- Henryk Giemza (National Centre for Nuclear Research (PL))
- Bruno Heinrich Hoeft (KIT - Karlsruhe Institute of Technology (DE))
Description
Computing Infrastructure
A robust computing infrastructure is essential for the success of scientific collaborations. However, smaller or newly founded collaborations often lack the resources to establish and maintain such an infrastructure, resulting in a fragmented analysis environment with varying solutions for different members. This fragmentation can lead to inefficiencies, hinder reproducibility, and create...
BaBar stopped data taking in 2008 but its data is still analyzed by the collaboration. In 2021 a new computing system outside of the SLAC National Accelerator Laboratory was developed and major changes were needed to keep the ability to analyze the data by the collaboration, while the user facing front ends all needed to stay the same. The new computing system was put in production in 2022 and...
Although wireless IoT devices are omnipresent in our homes and workplaces, their use in particle accelerators is still uncommon. Although the advantages of movable sensors communicating over wireless networks are obvious, the harsh radiation environment of a particle accelerator has been an obstacle to the use of such sensitive devices. Recently, though, CERN has developed a radiation-hard...
The modern data centers provide the efficient Information Technologies (IT) infrastructure needed to deliver resources,
services, monitoring systems and collected data in a timely fashion. At the same time, data centres have been continuously
evolving, foreseeing large increase of resources and adapting to cover multifaced niches.
The CNAF group at INFN (National Institute for Nuclear...
DESY operates multiple dCache storage instances for multiple communities. As each community has different workflows and workloads, their dCache installations range from very large instances with more than 100 PB of data, to instances with up to billions of files or instances with significant LAN and WAN I/O.
To successful operate all instances and quickly identify issues and performance...
Queen Mary University of London (QMUL) has recently finished refurbishing its data centre that house our computing cluster supporting the WLCG project. After 20 years of operation the original data centre had significant cooling issues and increases in energy prices have all driven the need for refurbishment amid growing awareness of climate change.In addition there is a need to increase the...
The ePIC collaboration is working towards the realization of the first detector at the upcoming Electron-Ion Collider. As part of our computing strategy, we have settled on containers for the distribution of our modular software stacks using spack as the package manager. Based on abstract definitions of multiple mutually consistent software environments, we build dedicated containers on each...
The economies of scale realised by institutional and commercial cloud providers make such resources increasingly attractive for grid computing. We describe an implementation of this approach which has been deployed for
Australia's ATLAS and Belle II grid sites.
The sites are built entirely with Virtual Machines (VM) orchestrated by an OpenStack [1] instance. The Storage Element (SE)...
A large fraction of computing workloads in high-energy and nuclear physics is executed using software containers. For physics analysis use, such container images often have sizes of several gigabytes. Executing a large number of such jobs in parallel on different compute nodes efficiently, demands the availability and use of caching mechanisms and image loading techniques to prevent network...
In recent years, the CMS experiment has expanded the usage of HPC systems for data processing and simulation activities. These resources significantly extend the conventional pledged Grid compute capacity. Within the EuroHPC program, CMS applied for a "Benchmark Access" grant at VEGA in Slovenia, an HPC centre that is being used very successfully by the ATLAS experiment. For CMS, VEGA was...
The Italian National Institute for Nuclear Physics (INFN) has recently developed a national cloud platform to enhance access to distributed computing and storage resources for scientific researchers. A critical aspect of this initiative is the INFN Cloud Dashboard, a user-friendly web portal that allows users to request high-level services on demand, such as Jupyter Hub, Kubernetes, and Spark...
Norwegian contributions to the WLCG consist of computing and storage resources in Bergen and Oslo for the ALICE and ATLAS experiments. The increasing scale and complexity of Grid site infrastructure and operation require integration of national WLCG resources into bigger shared installations. Traditional HPC resources often come with restrictions with respect to software, administration, and...
The German university-based Tier-2 centres successfully contributed a significant fraction of the computing power required for Runs 1-3 of the LHC. But for the upcoming Run 4, with its increased need for both storage and computing power for the various HEP computing tasks, a transition to a new model becomes a necessity. In this context, the German community under the FIDIUM project is making...
In a geo-distributed computing infrastructure with heterogeneous resources (HPC and HTC and possibly cloud), a key to unlock an efficient and user-friendly access to the resources is being able to offload each specific task to the best suited location. One of the most critical problems involve the logistics of wide-area with multi stage workflows back and forth multiple resource providers....
The MareNostrum 5 (MN5) is the new 750k-core general-purpose cluster recently deployed at the Barcelona Supercomputing Center (BSC). MN5 presents new opportunities for the execution of CMS data processing and simulation tasks but suffers from the same stringent network connectivity limitations as its predecessor, MN4. The innovative solutions implemented to navigate these constraints and...
The CMS experiment's operational infrastructure hinges significantly on the CMSWEB cluster, which serves as the cornerstone for hosting a multitude of services critical to the data taking and analysis. Operating on Kubernetes ("k8s") technology, this cluster powers over two dozen distinct web services, including but not limited to DBS, DAS, CRAB, WMarchive, and WMCore.
In this talk, we...
The efficient utilization of multi-purpose HPC resources for High Energy Physics applications is increasingly important, in particularly with regard to the upcoming changes in the German HEP computing infrastructure.
In preparation for the future, we are developing and testing an XRootD-based caching and buffering approach for workflow and efficiency optimizations to exploit the full...
According to the estimated data rates, it is predicted that 800 TB raw experimental data will be produced per day from 14 beamlines at the first stage of the High-Energy Photon Source (HEPS) in China, and the data volume will be even greater with the completion of over 90 beamlines at the second stage in the future. Therefore, designing a high-performance, scalable network architecture plays a...
In a DAQ system a large fraction of CPU resources is engaged in networking rather than in data processing. The common network stacks that take care of network traffic usually manipulate data through several copies performing expensive operations. Thus, when the CPU is asked to handle networking, the main drawbacks are throughput reduction and latency increase due to the overhead added to the...
The data reduction stage is a major bottleneck in processing data from the Large Hadron Collider (LHC) at CERN, which generates hundreds of petabytes annually for fundamental particle physics research. Here, scientists must refine petabytes into only gigabytes of relevant information for analysis. This data filtering process is limited by slow network speeds when fetching data from globally...
With the large dataset expected from 2029 onwards by the HL-LHC at CERN, the ATLAS experiment is reaching the limits of the current data processing model in terms of traditional CPU resources based on x86_64 architectures and an extensive program for software upgrades towards the HL-LHC has been set up. The ARM CPU architecture is becoming a competitive and energy efficient alternative....
GPUs and accelerators are changing traditional High Energy Physics (HEP) deployments while also being the key to enable efficient machine learning. The challenge remains to improve overall efficiency and sharing opportunities of what are currently expensive and scarce resources.
In this paper we describe the common patterns of GPU usage in HEP, including spiky requirements with low overall...
The Glance project provides software solutions for managing high-energy physics collaborations' data and workflow. It was started in 2003 and operates in the ALICE, AMBER, ATLAS, CMS, and LHCb CERN experiments on top of CERN common infrastructure. The project develops Web applications using PHP and Vue.js, running on CENTOS virtual machines hosted on the CERN OpenStack private cloud. These...
The ATLAS Collaboration operates a large, distributed computing infrastructure: almost 1M cores of computing and almost 1 EB of data are distributed over about 100 computing sites worldwide. These resources contribute significantly to the total carbon footprint of the experiment, and they are expected to grow by a large factor as a part of the experimental upgrades for the HL-LHC at the end of...
As UKRI moves towards a NetZero Digital Research Infrastructure [1] an understanding of how carbon costs of computing infrastructures can be allocated to individual scientific payloads will be required. The IRIS community [2] forms a multi-site heterogenous infrastructure so is a good testing ground to develop carbon allocation models with wide applicability.
The IRISCAST Project [3,4]...
The Glasgow ScotGrid facility is now a truly heterogeneous site, with over 4k ARM cores representing 20% of our compute nodes, which has enabled large-scale testing by the experiments and more detailed investigations of performance in a production environment. We present here a number of updates and new results related to our efforts to optimise power efficiency for High Energy Physics (HEP)...
In pursuit of energy-efficient solutions for computing in High Energy Physics (HEP) we have extended our investigations of non-x86 architectures beyond the ARM platforms that we have previously studied. In this work, we have taken a first look at the RISC-V architecture for HEP workloads, leveraging advancements in both hardware and software maturity.
We introduce the Pioneer Milk-V, a...
At INFN-T1 we recently acquired some ARM nodes: initially they were given to LHC experiments to test workflow and submission pipelines. After some time, they were given as standard CPU resources, since the stability both of the nodes and of the code was production quality ready.
In this presentation we will describe all the activities that were necessary to enable users to run on ARM and will...
The research and education community relies on a robust network in order to access the vast amounts of data generated by their scientific experiments. The underlying infrastructure connects a few hundreds of sites across the world, which require reliable and efficient transfers of increasingly large datasets. These activities demand proactive methods in network management, where potentially...
Research has become dependent on processing power and storage, with one crucial aspect being data sharing. The Open Science Data Federation (OSDF) project aims to create a scientific global data distribution network, expanding on the StashCache project to add new data origins and caches, access methods, monitoring, and accounting mechanisms. OSDF does not develop any new software, relying on ...
CERN's state-of-the-art Prévessin Data Centre (PDC) is now operational, complementing CERN's Meyrin Data Centre Tier-0 facility to provide additional and sustainable computing power to meet the needs of High-Luminosity LHC in 2029 (expected to be ten times greater than today). In 2019, it was decided to tender the design and construction of a new, modern, energy-efficient (PUE of ≤ 1.15) Data...
We present our unique approach to host the Canadian share of the Belle-II raw data and the computing infrastructure needed to process the raw data. We will describe the details of the storage system which is a disk-only storage solution based on xrootd and ZFS, as well as TSM for backup purpose. We will also detail the compute that involves starting specialized Virtual Machine (VMs) to process...
Large-scale scientific collaborations like ATLAS, Belle II, CMS, DUNE, and others involve hundreds of research institutes and thousands of researchers spread across the globe. These experiments generate petabytes of data, with volumes soon expected to reach exabytes. Consequently, there is a growing need for computation, including structured data processing from raw data to consumer-ready...
The PUNCH4NFDI consortium, funded by the German Research Foundation for an initial period of five years, gathers various physics communities - particle, astro-, astroparticle, hadron and nuclear physics - from different institutions embedded in the National Research Data Infrastructure initiative. The overall goal of PUNCH4NFDI is the establishment and support of FAIR data management solutions...
We are moving INFN-T1 data center to a new location. In this presentation we will describe all the steps taken to complete the task without decreasing the general availability of the site and of all the services provided.
We will also briefly describe the new features of our new data center compared to the current one.
The Square Kilometre Array (SKA) is set to revolutionise radio astronomy and will utilise a distributed network of compute and storage resources, known as SRCNet, to store, process and analyse the data at the exoscale. The United Kingdom plays a pivotal role in this initiative, contributing a significant portion of the SRCNet infrastructure. SRCNet v0.1, scheduled for early 2025, will...
In the High-Performance Computing (HPC) field, fast and reliable interconnects remain pivotal in delivering efficient data access and analytics.
In recent years, several interconnect implementations have been proposed, targeting optimization, reprogrammability and other critical aspects. Custom Network Interface Cards (NIC) have emerged as viable alternatives to commercially available...
The Worldwide Large Hadron Collider Computing Grid (WLCG) community’s deployment of dual-stack IPv6/IPv4 on its worldwide storage infrastructure is very successful and has been presented by us at earlier CHEP conferences. Dual-stack is not, however, a viable long-term solution; the HEPiX IPv6 Working Group has focused on studying where and why IPv4 is still being used, and how to flip such...
The Large Hadron Collider (LHC) experiments rely on a diverse network of National Research and Education Networks (NRENs) to distribute their data efficiently. These networks are treated as "best-effort" resources by the experiment data management systems. Following the High Luminosity upgrade, the Compact Muon Solenoid (CMS) experiment is projected to generate approximately 0.5 exabytes of...
The Network Optimised Experimental Data Transfer (NOTED) has undergone successful testing at several international conferences, including the International Conference for High Performance Computing, Networking, Storage and Analysis (also known as SuperComputing). It has also been tested at scale during the WLCG Data Challenge 2024, in which NREN's and WLCG sites conducted testing at 25% of the...
This presentation delves into the implementation and optimization of checkpoint-restart mechanisms in High-Performance Computing (HPC) environments, with a particular focus on Distributed MultiThreaded CheckPointing (DMTCP). We explore the use of DMTCP both within and outside of containerized environments, emphasizing its application on NERSC Perlmutter, a cutting-edge supercomputing system....