Conveners
Parallel (Track 4): Distributed Computing
- Fabio Hernandez (IN2P3 / CNRS computing centre)
- Panos Paparrigopoulos (CERN)
- Daniela Bauer (Imperial College (GB))
Parallel (Track 4): Distributed Computing
- Gianfranco Sciacca (Universitaet Bern (CH))
- Panos Paparrigopoulos (CERN)
Parallel (Track 4): Distributed Computing
- Fabio Hernandez (IN2P3 / CNRS computing centre)
- Gianfranco Sciacca (Universitaet Bern (CH))
Parallel (Track 4): Distributed Computing
- Daniela Bauer (Imperial College (GB))
- Panos Paparrigopoulos (CERN)
Parallel (Track 4): Distributed Computing
- Daniela Bauer (Imperial College (GB))
- Fabio Hernandez (IN2P3 / CNRS computing centre)
Description
Distributed Computing
Since 2017, the Worldwide LHC Computing Grid (WLCG) has been working towards enabling token-based authentication and authorization throughout its entire middleware stack.
Taking guidance from the WLCG Token Transition Timeline, published in 2022, substantial progress has been achieved not only in making middleware compatible with the use of tokens, but also in understanding the limitations...
Created in 2023, the Token Trust and Traceability Working Group (TTT) was formed in order to answer questions of policy and best practice with the ongoing move from X.509 and VOMS proxy certificates to token-based solutions as the primary authorisation and authentication method in grid environments. With a remit to act in an investigatory and advisory capacity alongside other working groups in...
Within the LHC community, a momentous transition has been occurring in authorization. For nearly 20 years, services within the Worldwide LHC Computing Grid (WLCG) have authorized based on mapping an identity, derived from an X.509 credential, or a group/role derived from a VOMS extension issued by the experiment. A fundamental shift is occurring to capabilities: the credential, a bearer...
Fermilab is the first High Energy Physics institution to transition from X.509 user certificates to authentication tokens in production systems. All of the experiments that Fermilab hosts are now using JSON Web Token (JWT) access tokens in their grid jobs. Many software components have been either updated or created for this transition, and most of the software is available to others as open...
INDIGO IAM (Identity and Access Management) is a comprehensive service that enables organizations to manage and control access to their resources and systems effectively. It implements a standard OAuth2 Authorization Service and OpenID Connect Provider and it has been chosen as the AAI solution by the WLCG community for the transition from VOMS proxy-based authorization to JSON web...
X.509 certificates and VOMS proxies are still widely used by various scientific communities for authentication and authorization (authN/Z) in Grid Storage and Computing Elements. Although this has contributed to improve the scientific collaboration worldwide, X.509 authN/Z comes with some interoperability issues with modern Cloud-based tools and services.
The Grid computing communities have...
The CMS computing infrastructure spread globally over 150 WLCG sites forms a intricate ecosystem of computing resources, software and services. In 2024, the production computing cores breached half a million mark and storage capacity is at 250 PetaBytes on disk and 1.20 ExaBytes on Tape. To monitor these resources in real time, CMS working closely with CERN IT has developed a multifaceted...
JAliEn, the ALICE experiment's Grid middleware, utilizes whole-node scheduling to maximize resource utilization from participating sites. This approach offers flexibility in resource allocation and partitioning, allowing for customized configurations that adapt to the evolving needs of the experiment. This scheduling model is gaining traction among Grid sites due to its initial performance...
Job pilots in the ALICE Grid have become increasingly tasked with how to best manage the resources given to each job slot. With the emergence of more complex and multicore oriented workflows, this has since become an increasingly challenging process, as users often request arbitrary resources, in particular CPU and memory. This is further exacerbated by often having several user payloads...
The Unified Experiment Monitoring (UEM) is the project in WLCG with the objective to harmonise the WLCG job accounting reports across the LHC experiments, in order to provide aggregated reports of the compute capacity used by WLCG along time. This accounting overview of all LHC experiments is vital for the strategy planning of WLCG and therefore it finds the strong support of the LHC Committee...
The risk of cyber attack against members of the research and education sector remains persistently high, with several recent high visibility incidents including a well-reported ransomware attack against the British Library. As reported previously, we must work collaboratively to defend our community against such attacks, notably through the active use of threat intelligence shared with trusted...
GlideinWMS has been one of the first middleware in the WLCG community to transition from X.509 to support also tokens. The first step was to get from the prototype in 2019 to using tokens in production in 2022. This paper will present the challenges introduced by the wider adoption of tokens and the evolution plans for securing the pilot infrastructure of GlideinWMS and supporting the new...
The WLCG infrastructure is quickly evolving thanks to technology evolution in all areas of LHC computing: storage, network, alternative processor architectures, new authentication & authorization mechanisms, etc. This evolution also has to address challenges like the seamless integration of HPC and cloud resources, the significant rise of energy costs, licensing issues and support changes....
This paper presents a comprehensive analysis of the implementation and performance enhancements of the new job optimizer service within the JAliEn (Java ALICE environment) middleware framework developed for the ALICE grid. The job optimizer service aims to efficiently split large-scale computational tasks into smaller grid jobs, thereby optimizing resource utilization and throughput of the...
HammerCloud (HC) is a framework for testing and benchmarking resources of the world wide LHC computing grid (WLCG). It tests the computing resources and the various components of distributed systems with workloads that can range from very simple functional tests to full-chain experiment workflows. This contribution concentrates on the ATLAS implementation, which makes extensive use of HC for...
In April 2023 HEPScore23, the new benchmark based on HEP specific applications, was adopted by WLCG, replacing HEP-SPEC06. As part of the transition to the new benchmark, the CPU core power published by the sites needed to be compared with the effective power observed while running ATLAS workloads. One aim was to verify the conversion rate between the scores of the old and the new benchmark....
In early 2024, ATLAS undertook an architectural review to evaluate the functionalities of its current components within the workflow and workload management ecosystem. Pivotal to the review was the assessment of the Production and Distributed Analysis (PanDA) system, which plays a vital role in the overall infrastructure.
The review findings indicated that while the current system shows no...
Efficient utilization of vast amounts of distributed compute resources is a key element in the success of the scientific programs of the LHC experiments. The CMS Submission Infrastructure is the main computing resource provisioning system for CMS workflows, including data processing, simulation and analysis. Resources geographically distributed across numerous institutions, including Grid, HPC...
The Square Kilometre Array (SKA) is set to be the largest and most sensitive radio telescope in the world. As construction advances, the managing and processing of data on an exabyte scale becomes a paramount challenge to enable the SKA science community to process and analyse their data. To address this, the SKA Regional Centre Network (SRCNet) has been established to provide the necessary...
The Cherenkov Telescope Array Observatory (CTAO) is the next-generation instrument in the very-high energy gamma ray astronomy domain. It will consist of tens of Cherenkov telescopes deployed in 2 CTAO array sites at La Palma (Spain) and Paranal (ESO, Chile) respectively. Currently under construction, CTAO will start operations in the coming years for a duration of about 30 years. During...
The Einstein Telescope is the proposed European next-generation ground-based gravitational-wave observatory, that is planned to have a vastly increased sensitivity with respect to current observatories, particularly in the lower frequencies. This will result in the detection of far more transient events, which will stay in-band for much longer, such that there will nearly always be at least...
The DUNE experiment will start running in 2029 and record 30 PB/year of raw waveforms from Liquid Argon TPCs and photon detectors. The size of individual readouts can range from 100 MB to a typical 8 GB full readout of the detector to extended readouts of up to several 100 TB from supernova candidates. These data then need to be cataloged, stored and then distributed for processing worldwide....
After several years of focused work, preparation for Data Release Production (DRP) of the Vera C. Rubin Observatory’s Legacy Survey of Space and Time (LSST) at multiple data facilities is taking its shape. Rubin Observatory DRP features both complex, long workflows with many short jobs, and fewer long jobs with sometimes unpredictably large memory usage. Both of them create scaling issues that...
The High Energy cosmic-Radiation Detection (HERD) facility is an under construction space astronomy and particle astrophysics experiment in collaboration between China and Italy, and will run on the China Space Station for more than 10 years since 2027. HERD is designed to search for dark matter with unprecedented sensitivity, investigate the century-old mystery of the origin of cosmic rays,...
LHAASO experiment is a new generation multi-component experiment designed to study cosmic rays and gamma-ray astronomy. The data volume from LHAASO are currently reaching to ~40PB and ~11PB of new data will be generated every year in the future. Such scale of data needs a big scale of computing resources to process. For LHAASO experiment, there are several types of computing sites to join the...
The Perlmutter HPC system is the 9th generation supercomputer deployed at the National Energy Research Scientific Computing Center (NERSC) It provides both CPU and GPU resources, offering 393216 AMD EPYC Milan cores with 4 GB of memory per core, for CPU-oriented jobs and 7168 NVIDIA A100 GPUs. The machine allows connections from the worker nodes to the outside and already mounts CVMFS for...
The ALICE Collaboration has begun exploring the use of ARM resources for the execution of Grid payloads. This was prompted by both their recent availability in the WLCG, as well as their increased competitiveness with traditional x86-based hosts in terms of both cost and performance. With the number of OEMs providing ARM offerings aimed towards servers and HPC growing, the presence of these...
The CernVM File System (CVMFS) is an efficient distributed, read-only file system that streams software and data on demand. Its main focus is to distribute experiment software and conditions data to the world-wide LHC computing infrastructure. In WLCG, more than 5 billion files are distributed via CVMFS and its read-only file system client is installed on more than 100,000 worker nodes. Recent...
The HEPCloud Facility at Fermilab has now been in operation for six years. This facility is used to give a unified provisioning gateway to high performance computing centers, including NERSC, ORLF, and ALCF, other large supercomputers run by the NSF, and commercial clouds. HEPCloud delivers hundreds of millions of core-hours yearly for CMS. HEPCloud also serves other Fermilab experiments...
The amount of data gathered, shared and processed in frontier research is set to increase steeply in the coming decade, leading to unprecedented data processing, simulation and analysis needs.
In particular, the research communities in High Energy Physics and Radio Astronomy are preparing to launch new instruments that require data and compute infrastructures several orders of magnitude...