- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Your profile timezone:
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
The CHEP conference series addresses the computing, networking and software issues for the world’s leading data‐intensive science experiments that currently analyse hundreds of petabytes of data using worldwide computing resources.
The CHEP conference location rotates between the Americas, Asia and Europe, and is typically held eighteen months apart. The CHEP 2024 conference will be hosted by the AGH University of Kraków, Institute of Nuclear Physics Polish Academy of Sciences and Jagiellonian University.
EGI Foundation supports CHEP with two coordinated projects:
See special offer for Conference Attendees! LINK
See INDICO for detailas.
Note the registration options LINK and list of recommended accomodation LINK.
Due to the tense political situation and conflict between Ukraine and the Russian Federation, all research institutions in Poland suspend scientific cooperation with institutions in Russia until further notice. Regrettably, we may not allow registration for individuals affiliated with Russian institutions.
Meeting point: in front of the venue, the Auditorium Maximum of Jagiellonian University - Krupnicza 33 Street
The IRIS-HEP software institute, as a contributor to the broader HEP Python ecosystem, is developing scalable analysis infrastructure and software tools to address the upcoming HL-LHC computing challenges with new approaches and paradigms, driven by our vision of what HL-LHC analysis will require. The institute uses a “Grand Challenge” format, constructing a series of increasingly large, complex, and realistic exercises to show the vision of HL-LHC analysis. Recently, the focus has been demonstrating the IRIS-HEP analysis infrastructure at scale and evaluating technology readiness for production.
As a part of the Analysis Grand Challenge activities, the institute executed a “200 Gbps Challenge”, aiming to show sustained data rates into the event processing of multiple analysis pipelines. The challenge integrated teams internal and external to the institute, including operations and facilities, analysis software tools, innovative data delivery and management services, and scalable analysis infrastructure. The challenge showcases the prototypes — including software, services, and facilities — built to process around 200 TB of data in both the CMS NanoAOD and ATLAS PHYSLITE data formats with test pipelines.
The teams were able to sustain the 200 Gbps target across multiple pipelines. The pipelines focusing on event rate were able to process at over 30MHz. These target rates are demanding; the activity revealed considerations for future testing at this scale and changes necessary for physicists to work at this scale in the future. The 200 Gbps challenge has established a baseline on today’s facilities, setting the stage for the next exercise at twice the scale.
For the High-Luminosity Large Hadron Collider era, the trigger and data acquisition system of the Compact Muon Solenoid experiment will be entirely replaced. Novel design choices have been explored, including ATCA prototyping platforms with SoC controllers and newly available interconnect technologies with serial optical links with data rates up to 28 Gb/s. Trigger data analysis will be performed through sophisticated algorithms, including widespread use of Machine Learning, in large FPGAs, such as the Xilinx Ultrascale family. The system will process over 50 Tb/s of detector data with an event rate of 750 kHz. The talk will discuss the technological and algorithmic aspects of the upgrade of the CMS trigger system, emphasizing the use of low-latency Machine Learning and AI algorithms with several examples.
Since the beginning of Run 3 of LHC the upgraded LHCb experiment is using a triggerless readout system collecting data at an event rate of 30 MHz and a data rate of 4 TB/s. The trigger system is split in two high-level trigger (HLT) stages. During the first stage (HLT1), implemented on GPGPUs, track reconstruction and vertex fitting for charged particles is performed to reduce the event rate to 1 MHz, where the events are buffered to a disk. In the second stage (HLT2), deployed on a CPU server farm, a full offline-quality reconstruction of charged and neutral particles and their selection is performed, aided by the detector alignment and calibration run in quasi-real time on buffered events. This allows to use the output of the trigger directly for offline analysis. In this talk we will give a review of the implementation and challenges of the heterogenous LHCb trigger system, discuss the operational experience and first results of Run 3 together with the prospects for the High-Luminosity LHC era.
Julia is a mature general-purpose programming language, with a large ecosystem of libraries and more than 10000 third-party packages, which specifically targets scientific computing. As a language, Julia is as dynamic, interactive, and accessible as Python with NumPy, but achieves run-time performance on par with C/C++. In this paper, we describe the state of adoption of Julia in HEP, where momentum has been gathering over a number of years.
HEP-oriented Julia packages can, via UnROOT.jl
, already read HEP's major file formats, including TTree and RNTuple formats. Interfaces to some of HEP's major software packages, such as through Geant4.jl
, are available too. Jet reconstruction algorithms in Julia show excellent performance. A number of full HEP analyses have been performed in Julia.
We show how, as the support for HEP has matured, developments have benefited from Julia's core design choices, which makes reuse from and integration with other packages easy. In particular, libraries developed outside HEP for plotting, statistics, fitting, and scientific machine learning are extremely useful.
We believe that the powerful combination of flexibility and speed, the wide selection of scientific programming tools, and support for all modern programming paradigms and tools, make Julia the ideal choice for a future language in HEP.
Detailed event simulation at the LHC is taking a large fraction of computing budget. CMS developed an end-to-end ML based simulation that can speed up the time for production of analysis samples of several orders of magnitude with a limited loss of accuracy. As the CMS experiment is adopting a common analysis level format, the NANOAOD, for a larger number of analyses, such an event representation is used as the target of this ultra fast simulation that we call FlashSim. Generator level events, from PYTHIA or other generators, are directly translated into NANOAOD events at several hundred Hz rate with FlashSim. We show how training FlashSim on a limited number of full simulation events is sufficient to achieve very good accuracy on larger datasets for processes not seen at training time. Comparisons with full simulation samples in some simplified benchmark analysis are also shown. With this work, we aim at establishing a new paradigm for LHC collision simulation workflows in view of HL-LHC.
The ATLAS Collaboration has released an extensive volume of data for research use for the first time. The full datasets of proton collisions from 2015 and 2016, alongside a wide array of matching simulated data, are all offered in the PHYSLITE format. This lightweight format is chosen for its efficiency and is the preferred standard for ATLAS internal analyses. Additionally, the inclusion of Heavy Ion collision data considerably widens the scope for research within the particle physics community. To ensure accessibility and usability, the release includes a comprehensive suite of software tools and detailed documentation, catering to a varied audience. Code examples, from basic Jupyter notebooks to more complex C++ analysis packages, aim to facilitate engagement with the data. This contribution details the available data, corresponding metadata, software, and documentation, and initial interactions with researchers outside the ATLAS collaboration, underscoring the project's potential to foster new research and collaborations.
Online and real-time computing
Since 2022, the LHCb detector is taking data with a full software trigger at the LHC proton-proton collision rate, implemented in GPUs in the first stage and CPUs in the second stage. This setup allows to perform the alignment & calibration online and to perform physics analyses directly on the output of the online reconstruction, following the real-time analysis paradigm. This talk will give a detailed overview of the LHCb trigger implementation and its underlying computing infrastructure, discuss challenges of using a heterogeneous architecture and report its performance in nominal data taking conditions during 2024 after two commissioning years.
The ATLAS experiment in the LHC Run 3 uses a two-level trigger system to select
events of interest to reduce the 40 MHz bunch crossing rate to a recorded rate
of up to 3 kHz of fully-built physics events. The trigger system is composed of
a hardware based Level-1 trigger and a software based High Level Trigger.
The selection of events by the High Level Trigger is based on a wide variety of
reconstructed objects, including leptons, photons, jets, b-jets, missing
transverse energy, and B-hadrons in order to cover the full range of the ATLAS
physics programme.
We will present an overview of improvements in the reconstruction, calibration,
and performance of the different trigger objects, as well as computational
performance of the High Level Trigger system.
Timepix4 is an innovative multi-purpose ASIC developed by the Medipix4 Collaboration at CERN for fundamental and applied physics detection systems. It is composed by a ~7cm$^2$ area matrix with about 230k independent pixels, each one with a charge integration circuit, a discriminator and a time-to-digital converter that allows to measure Time-of-Arrival with 195 ps width bins and Time-over-Threshold with 1.56 ns width bins. Timepix4 can produce up to 160 Gbps of output data, so a strong software counterpart is needed for fast and efficient data processing.
We developed an open-source multi-thread C++ framework to manage the Timepix4 ASIC, regardless of which control board is used for communication with the server. The software can configure Timepix4 through low and high level functions, depending on the final user’s expertise and his need for customization. Those methods also allow the user to easily perform complex routines, like pixel matrix equalization and calibration, with user-friendly C++ scripts.
When the acquisition starts, some read-out threads can safely store Timepix4 data on disk. Offline post-acquisition classes can be used to analyze the data, using a custom clustering algorithm that can process more than 1M events/s and, if needed, an ad-hoc convolutional neural network for particle track identification. If the acquisition rate is lower than 1M events/s, clustering can be performed online, exploiting a dedicated thread, connected to read-out ones, that runs the same algorithm. Moreover, an online monitor thread can be connected to clustering object to view up to O(100)kEvents/s, showing a hit-map and real-time statistics like cluster dimension and energy.
In this contribution we will present the software architecture, its performances and some results obtained during acquisitions using radioactive sources, X-ray tubes and monochromatic synchrotron X-ray beams.
The NA62 experiment is designed to study kaon’s rare decays using a decay-in-flight technique. Its Trigger and Data Acquisition (TDAQ) system is multi-level, making it critically dependent on the performance of the inter-level network.
To manage the enormous amount of data produced by the detectors, three levels of triggers are used. The first level L0TP, implemented using an FPGA device, has been in operation since the start of data taking in 2016.
To increase the efficiency of the system and implement additional algorithms, an upgraded system (L0TP+) was developed starting in 2018. This upgrade utilizes a high-end FPGA available on the market, offering more computing power, larger local memory, and higher transmission bandwidth.
We have planned tests for a new trigger algorithm that implements quadrant-based logic for the veto systems. This new approach is expected to improve the main trigger efficiency by several percent.
Extensive tests were conducted using a parasitic setup that included a set of Network TAPs and a commodity server, allowing for proficient comparison of trigger decisions on an event-by-event basis. The experience gained from this parasitic mode operation can be leveraged for the next data-taking period as a development setup to implement additional features, thereby accelerating the TDAQ upgrade.
After the testing period, the new system has been adopted as the online processor since 2023. Preliminary results on the efficiency of the new system will be reported. Integration with the new AI-based FPGA-RICH system, which performs online partial particle identification, will also be discussed.
Digital ELI-NP List-mode Acquisition (DELILA) is a data acquisition (DAQ) system for the Variable Energy GAmma (VEGA) beamline system at Extreme Light Infrastructure – Nuclear Physics (ELI-NP), Magurele, Romania [1]. ELI-NP has been implementing the VEGA beamline and entirely operate the beamline in 2026. Several different detectors/experiments (e.g. High Purity Ge (HPGe) detectors, Si detectors and scintillator detectors) will be placed at the VEGA beam line and read out by CAEN digitizers, Mesytec ADC and TDC, and some other electronics [2]. DELILA has been developed using mainly DAQ-Middleware and CAEN digitizer libraries to fit the experiments and the read-out electronics [3]. The main requirements are network transparency and synchronized time stamps. DAQ-Middleware allows us to fetch data from different electronics and computers to a data merger via Ethernet.
DELILA uses two databases to record experimental information: MongoDB for the run information and InfluxDB for event rates. DELILA uses ROOT libraries for online monitoring and recording experiment data.
The DAQ system has been used for several experiments at IFIN-HH 9MV and 3MV tandem beamlines in Romania [4]. The presenter will present the implementation and results of DELILA.
[1] S. Gales, K.A. Tanaka et al., Rep. Prog. Phys. 81 094301 (2018)
[2] N.V. Zamfir et al., Romanian Reports in Physics 68 Supplement, S3–S945 (2016)
[3] Y. Yasu et al., J. Phys.: Conf. Ser. 219 022025 (2010)
[4] S.Aogaki et al. Nuclear Inst. and Methods in Physics Research, A 1056 (2023) 168628
The ePIC collaboration adopted the JANA2 framework to manage its reconstruction algorithms. This framework has since evolved substantially in response to ePIC's needs. There have been three main design drivers: integrating cleanly with the PODIO-based data models and other layers of the key4hep stack, enabling external configuration of existing components, and supporting timeframe splitting for streaming readout. The result is a unified component model featuring a new declarative interface for specifying inputs, outputs, parameters, services, and resources. This interface enables the user to instantiate, configure, and wire components via an external file. One critical new addition to the component model is a hierarchical decomposition of data boundaries into levels such as Run, Timeframe, PhysicsEvent, and Subevent. Two new component abstractions, Folder and Unfolder, are introduced in order to traverse this hierarchy, e.g. by splitting or merging. The pre-existing components can now operate at different event levels, and JANA2 will automatically construct the corresponding parallel processing topology. This means that a user may write an algorithm once, and configure it at runtime to operate on timeframes or on physics events. Overall, these changes mean that the user requires less knowledge about the framework internals, obtains greater flexibility with configuration, and gains the ability to reuse the existing abstractions in new streaming contexts.
Offline Computing
RNTuple is the new columnar data format designed as the successor to ROOT's TTree format. It allows to make use of modern hardware capabilities and is expected to be used in production by the LHC experiments during the HL-LHC. In this contribution, we will discuss the usage of Direct I/O to fully exploit modern SSDs, especially in the context of the recent addition of parallel RNTuple writing. In contrast to buffered I/O where files are accessed via the operating system's page cache, Direct I/O circumvents all caching by the kernel and thereby enables higher bandwidths. However, to achieve this advantage, Direct I/O imposes strict alignment requirements on the I/O requests sent to the operating system: In particular, file offsets, byte counts and userspace buffer addresses must be aligned appropriately. This is challenging for columnar data formats and RNTuple pages that have variable size after compression. We will discuss possible strategies and performance results for both synthetic benchmarks as well as real-world applications.
Machine Learning (ML)-based algorithms play increasingly important roles in almost all aspects of the data analyses in ATLAS. Diverse ML models are used in detector simulations, event reconstructions, and data analyses. They are being deployed in the ATLAS software framework, Athena. The primary approach to perform ML inference in Athena is to use the ONNXRuntime. However, some ML models could not be converted to ONNXRuntime because certain ML operations, such as the MultiAggregation
in pyG as of writing, are not supported. Furthermore, a scalable inference strategy that maximises the event processing throughput is needed to cope with the ever-increasing simulation and collision data. A key element in that strategy should be enabling these ML algorithms to run on coprocessors like GPUs because not all computing sites have coprocessors. To that end, we introduce AthenaTriton, a tool that runs ML inference as a service based on the NVIDIA Triton Inference Server. With AthenaTriton, we give Athena the capability to act as a Triton client that sends requests to a remote or local server that performs the model inference. We will present the AthenaTriton design and its scalability in running ML-based algorithms. We emphasise that AthenaTriton can be used in both online and offline computing.
The KM3NeT collaboration is constructing two underwater neutrino detectors in the Mediterranean Sea sharing the same technology: the ARCA and ORCA detectors. ARCA is optimized for the observation of astrophysical neutrinos, while ORCA is designed to determine the neutrino mass hierarchy by detecting atmospheric neutrinos. Data from the first deployed detection units are being analyzed and several physics analyses have already been presented. As the detector configurations are growing and therefore, the amount of the recorded data, efficient data quality and processing management are essential.
Data reconstruction and Monte Carlo simulations are handled separately for each data taking period (run), to achieve complete processing output and optimal computing performance. A Run-by-Run simulation procedure is followed, to reproduce the conditions, possible seawater environment variations as well as the acquisition setup for each run. To handle computing requirements such as portability, reproducibility and scalability, the collaboration implemented this approach using Snakemake, a trending workflow management system.
The High Energy cosmic-Radiation Detection facility (HERD) is a scientific instrument planned for deployment on the Chinese Space Station, aimed at indirectly detecting dark matter and conducting gamma-ray astronomical research. HERD Offline Software (HERDOS) is developed for the HERD offline data processing, including Monte Carlo simulation, calibration, reconstruction and physics analysis tasks. HERDOS is developed based on SNiPER, a lightweight framework designed for HEP experiments, as well as a few state-of-the-art software in the HEP community, such as Detector Description Toolkit (DD4hep), the plain-old-data I/O (podio) and Intel Thread Building Blocks (TBB), etc.
This contribution will provide an overview of the design and implementation details of HERDOS, and in particular, the following details will be addrssed:
1. The design of the Event Data Model (EDM) based on Podio, and the implementation of data management system (DMS) through the integration of Podio and SNiPER.
2. The parallelized DMS based on SNiPER and TBB, specifically the development of GlobalStore based on the Podio to enable concurrent data access and data I/O.
3. The parallelized detector simulation based on MT-SNiPER,including both event-level and track-level parallelism.
4. The geometry management system based on DD4hep that provides consistent detector description, an easy-to-use interface to retrieve detector description information.
At present, HERDOS is operating effectively to support the design of the detector, as well as the exploration of its physics potential.
Run 4 of the LHC will yield an unprecedented volume of data. In order
to process this data, the ATLAS collaboration is evolving its offline
software to be able to use heterogenous resources such as GPUs and FPGAs.
To reduce conversion overheads, the event data model (EDM) should be
compatible with the requirements of these resources. While the
ATLAS EDM has long allowed representing data as a structure of arrays,
further evolution of the EDM can enable more efficient sharing of data
between CPU and GPU resources. Some of this work will be summarized here,
including extensions to allow controlling how memory for event data
is allocated and implementation of jagged vectors.
After two successful physics runs the LHCb experiment underwent a comprehensive upgrade to enable LHCb to run at five times the instantaneous luminosity for Run 3 of the LHC. With this upgrade, LHCb is now the largest producer of data at the LHC. A new offline dataflow was developed to facilitate fast time-to-insight whilst respecting constraints from disk and CPU resources. The Sprucing is an offline data processing step that further refines the selections and persistency of physics channels coming out of the trigger system. In addition, the Sprucing splits the data into multiple streams, which are written in a format that facilitates more efficient compression. Next, Analysis Productions provide LHCb analysts with a declarative approach to tupling this data, efficiently exploiting WLCG resources in a centralised way.
The Sprucing and Analysis Productions offline chain provides analysts with their customised tuples within days of the data being taken by the LHCb experiment.
This talk will present the development of this offline data processing chain with a focus on performance results gathered during operations in 2024.
Offline Computing
Tracking charged particles in high-energy physics experiments is a computationally intensive task. With the advent of the High Luminosity LHC era, which is expected to significantly increase the number of proton-proton interactions per beam collision, the amount of data to be analysed will increase dramatically. As a consequence, local pattern recognition algorithms suffer from scaling problems.
In this work, we investigate the possibility of using machine learning techniques in combination with quantum computing. In particular, we represent particle trajectories as a graph data structures and train a quantum graph neural network to perform global pattern recognition. We show recent results on the application of this method, with scalability tests for increasing pileup values. We discuss the critical points and give an outlook of potential improvements and alternative approaches.
We also provide insights into various aspects of code development in different quantum programming frameworks such as Pennylane and IBM Qiskit.
With the future high-luminosity LHC era fast approaching high-energy physics faces large computational challenges for event reconstruction. Employing the LHCb vertex locator as our case study we are investigating a new approach for charged particle track reconstruction. This new algorithm hinges on minimizing an Ising-like Hamiltonian using matrix inversion. Performing this matrix inversion classically achieves reconstruction efficiency akin to the current state-of-the-art algorithms but is hindered by worse time complexity. Exploiting the Harrow-Hassadim-Lloyd (HHL) quantum algorithm for linear systems holds the promise of an exponential speedup in the number of input hits over its classical counterpart. Contingent upon the following conditions: efficient quantum phase estimation (QPE) and an intuitive way to read out the algorithm's output. This contribution builds on previous work (DOI 10.1088/1748-0221/18/11/P11028) and strives to fulfil these conditions and streamlines the proposed algorithm's circuit depth, by a factor up to $10^4$. We propose a modified version of the HHL algorithm by restricting QPE precision to two bits. Enabling us to introduce a novel post-processing algorithm, which estimates event Primary Vertices (PVs), then efficiently computes all event tracks though an Adaptive Hough Transform. This alteration significantly reduces circuit depth and addresses HHL's readout issue, bringing the reconstruction of small events closer to current hardware implementation. The findings presented here aim to further illuminate the potential of harnessing quantum computing for the future of particle track reconstruction in high-energy physics.
The Super Tau Charm Facility (STCF) is a future electron-positron collider proposed with a center-of-mass energy ranging from 2 to 7 GeV and a peak luminosity of 0.5$\times10^{35}$ ${\rm cm}^{-2}{\rm s}^{-1}$. In STCF, the identification of high-momentum hadrons is critical for various physics studies, therefore two Cherenkov detectors (RICH and DTOF) are designed to boost the PID performance.
In this work, targeting the pion/kaon identification at STCF, we developed a PID algorithm based on the convolutional neural network (CNN) for the DTOF detector, which combines the hit channel and arrival time of Cherenkov photons at multi-anode microchannel plate photomultipliers. The current performance meets the physics requirements of STCF, with a pion identification efficiency exceeding 97% along with a kaon misidentification rate of less than 2% at p = 2Gev/c. In addition, based on classical CNN, we conducted a proof-of-concept study on quantum convolutional neural networks (QCNN) to explore potential quantum advantages and feasibility. Preliminary results indicate that QCNN has a promising potential to outperform classical CNN on a same dataset.
Noisy intermediate-scale quantum (NISQ) computers, while limited by imperfections and small scale, hold promise for near-term quantum advantages in nuclear and high-energy physics (NHEP) when coupled with co-designed quantum algorithms and special-purpose quantum processing units.
Developing co-design approaches is essential for near-term usability, but inherent challenges exist due to the fundamental properties of NISQ algorithms.
In this contribution we therefore investigate the core algorithms, which can solve optimisation problems via the abstraction layer of a quadratic Ising model or general unconstrained binary optimisation problems (QUBO), namely quantum annealing (QA) and the quantum approximate optimisation algorithm (QAOA).
Applications in NHEP utilising QUBO formulations range from particle track reconstruction, over job scheduling on computing clusters to experimental control.
While QA and QAOA do not inherently imply quantum advantage, QA runtime for specific problems can be determined based on the physical properties of the underlying Hamiltonian, albeit it is a computationally hard problem itself.
Our primary focus is on two key areas:
Firstly, we estimate runtimes and scalability for common NHEP problems addressed via QUBO formulations by identifying minimum energy solutions of intermediate Hamiltonian operators encountered during the annealing process.
Secondly, we investigate how the classical parameter space in the QAOA, together with approximation techniques such as a Fourier-analysis based heuristic, proposed by Zhou et al. (2018), can help to achieve (future) quantum advantage, considering a trade-off between computational complexity and solution quality.
Our computational analysis of seminal optimisation problems suggests that only lower frequency components in the parameter space are of significance for deriving reasonable annealing schedules, indicating that heuristics can offer improvements in resource requirements, while still yielding near-optimal results.
Quantum computing can empower machine learning models by enabling kernel machines to leverage quantum kernels for representing similarity measures between data. Quantum kernels are able to capture relationships in the data that are not efficiently computable on classical devices. However, there is no straightforward method to engineer the optimal quantum kernel for each specific use case.While recent literature has focused on exploiting the potential offered by the presence of symmetries in the data to guide the construction of quantum kernels, we adopt here a different approach, which employs optimization techniques, similar to those used in neural architecture search and AutoML, to automatically find an optimal kernel in a heuristic manner. The algorithm we present constructs a quantum circuit implementing the similarity measure as a combinatorial object, which is evaluated based on a cost function and is then iteratively modified using a meta-heuristic optimization technique. The cost function can encode many criteria ensuringfavorable statistical properties of the candidate solution, such as the rank of the Dynamical Lie Algebra. Importantly, our approach is independent of the optimization technique employed.The results obtained by testing our approach on a high-energy physics problem demonstrate that, in the best-case scenario, we can either match or improve testing accuracy with respect to the manual design approach, showing the potential of our technique to deliver superior results with reduced effort.
Simulation and analysis tools
At the LHC experiments, RNTuple is emerging as the primary data storage solution, and will be ready for production next year. In this context, we introduce the latest development in UnROOT.jl, a high-performance and thread-safe Julia ROOT I/O package that facilitates both the reading and writing of RNTuple data.
We briefly share insights gained from implementing RNTuple Reader twice: first in Python, and then in Julia. We discuss the composability of the RNTuple type system and demonstrate how Julia's multiple dispatch feature has been effectively employed to realize this concisely.
Regarding the implementation of RNTuple Writer, we outline the current capabilities and illustrate how they support end-user analyses. Furthermore, we present a roadmap for future development aimed at achieving seamless data I/O interoperability across various programming languages and libraries, including C++, Python, and Julia.
Lastly, we showcase the capabilities and performance of our Julia implementation with real examples. We highlight how our solution facilitates interactive analysis for end-users utilizing RNTuple.
The Fair Universe project is organising the HiggsML Uncertainty Challenge, which will/has run from June to October 2024.
This HEP and Machine Learning competition is the first to strongly emphasise uncertainties: mastering uncertainties in the input training dataset and outputting credible confidence intervals.
The context is the measurement of the Higgs to tau+ tau- cross section like in HiggsML challenge on Kaggle in 2014, from a dataset of the 4-momentum signal state. Participants should design an advanced analysis technique that can not only measure the signal strength but also provide a confidence interval, from which correct coverage will be evaluated automatically from pseudo-experiments.
The confidence interval should include statistical and systematic uncertainties (concerning detector calibration, background levels, etc…). It is expected that advanced analysis techniques that can control the impact of systematics will perform best, thereby pushing the field of uncertainty-aware AI techniques for HEP and beyond.
The challenge is hosted on Codabench (an evolution of the popular Codalab platform); the significant resources needed (to run the thousands of pseudo-experiments needed) are possible thanks to using NERSC infrastructure as a backend.
The competition will have ended just before CHEP 2024 so that a first glimpse of the competition results could be made public for the first time.
The high luminosity LHC (HL-LHC) era will deliver unprecedented luminosity and new detector capabilities for LHC experiments, leading to significant computing challenges with storing, processing, and analyzing the data. The development of small, analysis-ready storage formats like CMS NanoAOD (4kB/event), suitable for up to half of physics searches and measurements, helps achieve necessary reductions in data processing and storage. However, a large fraction of analyses frequently require very computationally expensive machine learning output or data only stored in larger and less accessible formats, such as CMS MiniAOD (45kB/eevent) or AOD (450kB/event). This necessitates the non-volatile storage of derived data in custom formats. In this work, we present research on the development of workflows and integration of tools with ServiceX to efficiently fetch, cache, and join together data for use with columnar analysis tools.
We leverage scaleable, distributed SQL query engines like Trino to join disparate columns sourced from multiple files and without a restriction on relative row ordering. By replacing many customized datasets, containing largely overlapping contents, with smaller and unique sets of information that can be joined on demand with common central data, duplication can be reduced. Caching of these results keeps the cost of subsequent retrieval low, fitting well with modern physics analysis paradigms.
The software toolbox used for "big data" analysis in the last few years is rapidly changing. The adoption of software design approaches able to exploit the new hardware architectures and improve code expressiveness plays a pivotal role in boosting data processing speed, resources optimisation, analysis portability and analysis preservation.
The scientific collaborations in the field of High Energy Physics (e.g. the LHC experiments, the next-generation neutrino experiments, and many more) are devoting increasing resources to the development and implementation of bleeding-edge software technologies in order to cope effectively with always growing data samples, pushing the reach of the single experiment and of the whole HEP community.
The introduction of declarative paradigms in the analysis description and implementation is growing interest and support in the main collaborations. This approach can simplify and speed-up the analysis description phase, support the portability of the analyses among different datasets/experiments and strengthen the preservation and reproducibility of the results.
Furthermore this approach, providing a deep decoupling between the analysis algorithm and back-end implementation, is a key element for present and future processing speed, potentially even with back-ends not existing today.
In the landscape of the approaches currently under study, an activity is ongoing in the ICSC (Centro Nazionale di Ricerca in HPC, Big Data and Quantum Computing, Italy) which focuses on the development of a framework characterised by a declarative paradigm for the analysis description and able to operate on datasets from different experiments.
The existing NAIL (Natural Analysis Implementation Language [1]) Python package, developed in the context of the CMS data analysis for the event processing, is used as a building base for the development of a demonstrator able to provide a general and effective interface characterised by a declarative paradigm and targeted to the description and implementation of a full analysis chain for HEP data, with support for different data formats.
Status and development plan of the demonstrator will be discussed.
[1]https://indico.cern.ch/event/769263/contributions/3413006/attachments/1840145/3016759/NAIL_Project_Natural_Analysis_Implementation_Language_1.pdf
The ATLAS experiment is in the process of developing a columnar analysis demonstrator, which takes advantage of the Python ecosystem of data science tools. This project is inspired by the analysis demonstrator from IRIS-HEP.
The demonstrator employs PHYSLITE OpenData from the ATLAS collaboration, the new Run 3 compact ATLAS analysis data format. The tight integration of ROOT features within PHYSLITE presents unique challenges when integrating with the Python analysis ecosystem. The demonstrator is constructed from ATLAS PHYSLITE OpenData, ensuring the accessibility and reproducibility of the analysis.
The analysis pipeline of the demonstrator incorporates a comprehensive suite of tools and libraries. These include uproot for data reading, awkward-array for data manipulation, Dask for parallel computing, and hist for histogram processing. For the purpose of statistical analysis, the pipeline integrates cabinetry and pyhf, providing a robust toolkit for analysis. A significant component of this project is the custom application of corrections, scale factors, and systematic errors using ATLAS software. Therefore for this component we conduct a comparative analysis of event processing throughput across both the event-loop and columnar analysis environments. The infrastructure and methodology for these applications will be discussed in detail during the presentation, underscoring the adaptability of the Python ecosystem for high-energy physics analysis.
Over the past few decades, there has been a noticeable surge in muon tomography research, also referred to as muography. This method, falling under the umbrella of Non-Destructive Evaluation (NDE), constructs a three-dimensional image of a target object by harnessing the interaction between cosmic ray muons and matter, akin to how radiography utilizes X-rays. Essentially, muography entails scanning a target object by analyzing its interaction with muons, with the interaction mode contingent upon the energy of the incident muon and the characteristics of the medium involved. As cosmic muons interact electromagnetically with atoms within the target medium, their trajectories are likely to deviate prior to reaching the position sensitive detectors placed at suitable locations around the object under study. These deviations serve as a rich source of data that can be used to generate images and infer the material composition of the target.
In this study, a numerical simulation has been conducted using the GEANT4 framework to assess the efficacy of various position sensitive charged particle detectors in muography. The feasibility of detectors with a broad range of position resolutions has been tested, particularly in the context of developing an imaging algorithm to monitor drums containing nuclear waste. The Cosmic Ray Shower Library (CRY) has been employed to simulate muon showers on the detector-target system. The reconstruction of muon tracks, crucial for analyzing muon scattering, has been achieved by collecting hits from all detector layers. Incoming muon tracks have been reconstructed using hits from the upper set of detectors, while outgoing muon tracks have been reconstructed using hits from the lower set. In this presentation, the discussion will center on track reconstruction algorithms, emphasizing the use of efficient single scattering point algorithms like Point of Closest Approach (PoCA) for simplified implementation and fast computation. To enhance material discrimination confidence, a Support Vector Machine (SVM) based algorithm has been applied, utilizing features such as scattering vertices density (𝜌𝑐) and average deviation angle (𝜃𝑎𝑣𝑔) as inputs. SVM hyperplanes have been generated to segregate various material classes, and corresponding confusion matrices have been obtained. Additionally, for analyzing the shape of materials within nuclear waste drums, an algorithm based on Pattern Recognition Method (PRM) has been employed. This presentation will delve into studies of track reconstruction algorithms applied to GEANT4 data for particle detectors with varying position resolutions, followed by shape and image analysis based on the PRM with the motivation of optimizing storage of nuclear waste that can be efficiently monitored by techniques such as muography.
Simulation and analysis tools
The Jiangmen Underground Neutrino Observatory (JUNO) is a neutrino experiment under construction in the Guangdong province of China. The experiment has a wide physics program with the most ambitious goal being the determination of the neutrino mass ordering and the high-precision measurement of neutrino oscillation properties using anti-neutrinos produced in the 50 km distant commercial nuclear reactors of Taishan and Yangjiang.
To reach its aims, the detector features an acrylic sphere of 35.4 meters in diameter filled with 20 kt of liquid scintillator and equipped with 17612 20-inch photomultiplier tubes (PMTs) and 25 600 3-inch PMTs to provide an energy resolution better than 3% at 1 MeV. In addition to the cutting-edge features and performance of the detector, a critical aspect for achieving the physics goals is a deep understanding of such a complicated detector. In this respect, an accurate Monte Carlo (MC) simulation of the detector and the interactions happening inside of it is crucial. The simulation depends on many effective parameters, which must be tuned to accurately describe the data acquired.
In this contribution, we propose a novel machine-learning approach to MC tuning that combines Generative Learning and data acquired during calibration campaigns. We study Generative Adversarial Networks (GAN) as a way to speed up event simulation and as an efficient model to interpolate within the parameter space. We consider three main parameters related to the energy response of the JUNO detector and optimize their value in the MC by comparing calibration data to the GAN simulations. Parameter estimation is performed via Bayesian optimization based on a Nested Sampling algorithm to cope with the wide and complex parameter space.
The presented approach is easily scalable to include more parameters and is general enough to be employed in most modern physics experiments.
Built on algorithmic differentiation (AD) techniques, differentiable programming allows to evaluate derivatives of computer programs. Such derivatives are useful across domains for gradient-based design optimization and parameter fitting, among other applications. In high-energy physics, AD is frequently used in machine learning model training and in statistical inference tasks such as maximum likelihood estimation. Recently, AD has begun to be explored for the end-to-end optimization of particle detectors, with potential applications ranging from HEP to medical physics to astrophysics. To that end, the ability to estimate derivatives of the Geant4 simulator for the passage of particles through matter would be a huge step forward.
The complexity of Geant4, its programmatic control flow, and its underlying stochastic sampling processes, introduce challenges that cannot all be addressed by current AD tools. As such, the application of current AD tools to Geant4-like simulations can provide invaluable insights into the accuracy and errors of the AD gradient estimates and into how to address remaining challenges.
In this spirit, we have applied the operator-overloading AD tool CoDiPack to the compact G4HepEm/HepEmShow package for the simulation of electromagnetic showers in a simple sampling calorimeter. Our AD-enabled simulator allows to estimate derivatives of energy depositions with respect to properties of the geometry and the incoming particles. The derivative estimator comes with a small bias, which however proved unproblematic in a simple optimization study. In this talk, we will report on our methodology and encouraging results, and demonstrate how a next-generation AD tool, Derivgrind, can be used to bring these results to the scale of Geant4.
The ATLAS experiment at the LHC heavily depends on simulated event samples produced by a full Geant4 detector simulation. This Monte Carlo (MC) simulation based on Geant4 is a major consumer of computing resources and is anticipated to remain one of the dominant resource users in the HL-LHC era. ATLAS has continuously been working to improve the computational performance of this simulation for the Run 3 MC campaign. This update highlights the implementation of recent and upcoming optimizations. These improvements include enhancements to the core Geant4 software, strategic choices in simulation configuration, simplifications in geometry and magnetic field descriptions, and technical refinements in the interface between ATLAS simulation code and Geant4. Overall, these improvements have resulted in a more than 100% increase in throughput compared to the baseline simulation configuration utilized during Run 2.
For the start of Run-3 CMS Full Simulation was based on Geant4 10.7.2. In this work we report on evolution of usage of Geant4 within CMSSW and adaptation of the newest Geant4 11.2.1, which is expected to be used for CMS simulation production in 2025. Physics validation results and results on CPU performance are reported.
For the Phase-2 simulation several R&D are carried out. A significant update for CMS geometry description is performed using DD4hep and VecGeom tools, modifications of the CMS geometry concern a new tracker, a new timing detector, an extended muon system, and a new endcap high granular calorimeter. Different aspects of geometry description and physics simulation for the new detectors will be discussed. Progress on R&D efforts for the Phase-2 simulation will be presented, which includes reports on experience of application of G4HepEm, Seleritas, and AdePT external libraries.
The Compressed Baryonic Matter (CBM) is an under-construction heavy-ion physics experiment for exploring the QCD phase diagram at high $\mu_{B}$ which will use the new SIS-100 accelerator at the Facility for Anti-Proton and Ion Research (FAIR) in Darmstadt, Germany. The Silicon Tracking System (STS) is to be the main detector for tracking and momentum determination. A scaled-down prototype of various detector systems including mini STS (mSTS) is undergoing meticulous testing in the mini CBM (mCBM) experiment at the existing SIS-18 accelerator at GSI, Helmholtzzentrum f$\ddot{u}$r Schwerionenforschung in Darmstadt. This initiative seeks to comprehensively assess both hardware and software components, ensuring their efficacy in online capturing, processing and analyzing the intricate topological data generated by real events detected by the detector sub-systems.
In recent years, much effort has been put into a better and more accurate description of the detector geometries to better model the background. The direct conversion of Computer-Aided Design (CAD) based geometry model to Geometry Description Markup Language (GDML), XML-based format using different software toolkits has attracted considerable attention. The solids extracted from CAD models and represented in GDML format typically consist of triangular or quadrilateral facets. $\texttt{TGDMLParser}$ functionality in the ROOT and $\texttt{G4GDMLParser}$ in the GEANT facilitate the reading of different volumes from the GDML file and the creation of volume assemblies. However, this approach leads to an increase in simulation computation run-time.
We will present a comparative analysis of simulation studies with two distinct representations of the mSTS geometry: one employing simplified primitive ROOT/TGeo solids and the other utilizing Tessellated solid-based geometry, including secondary particle production, the significance of passive volumes, computation time; as well as a comparison of simulation data with real data measured with Ni-Ni collisions at 1.93 AGeV.
The IceCube Neutrino Observatory instruments one cubic kilometer of glacial ice at the geographic South Pole. Cherenkov light emitted by charged particles is detected by 5160 photomultiplier tubes embedded in the ice. Deep antarctic ice is extremely transparent, resulting in absorption lengths exceeding 100m. However, yearly variations in snow deposition rates on the glacier over the last 100 thousand years have created roughly horizontal layers which vary significantly in scattering and absorption coefficients. Theses variations must be taking into account when simulating IceCube events. In addition, anisotropies in photon propagation have been observed and recently described by deflection by birefringent polycrystals. Modeling of ice properties remains one of the largest sources of systematic uncertainties in IceCube analyses, requiring intensive studies of the ice. Despite the fact that photon tracking is highly parallelizable and is an ideal case for GPUs, the limiting constraint for these studies is time spent simulating photon propagation. In order to efficiently and accurately perform these simulations, custom software has been developed and optimized for our specific use case. IceCube's current production simulation code CLSim is based on OpenCL and is tightly coupled to the IceCube's simulation stack and is in need of modernization. This talk will discuss the current requirements for Photon tracking code in IceCube and the effort to transition the code to new C++ frameworks which uses std::par.
Collaborative software and maintainability
The ATLAS offline code management system serves as a collaborative framework for developing a code base totaling more than 5 million lines. Supporting up to 50 nightly release branches, the ATLAS Nightly System offers abundant opportunities for updating existing software and developing new tools for forthcoming experimental stages within a multi-platform environment. This paper describes the utilization of container technology for the ATLAS nightly jobs. By conducting builds and tests of offline releases within containers, we ensure portability across various build nodes. The controlled container environment enhances stability by removing dependencies on operating system updates. Furthermore, it sets the base and facilitates the production of containerized software across different user activity areas and pipelines. The ATLAS experiment has accumulated data since 2009. It is important to maintain access to software for processing and analyzing historical data developed on outdated operating systems. Container technology plays an indispensable role in providing secure and operationally sound environments for building and testing on such operating systems. This document provides details on the organizational support for OS containers used in software building, including methods for setting up runtime environments.
The ATLAS experiment will undergo major upgrades for operation at the high luminosity LHC. The high pile-up interaction environment (up to 200 interactions per 40MHz bunch crossing) requires a new radiation-hard tracking detector with a fast readout.
The scale of the proposed Inner Tracker (ITk) upgrade is much larger than the current ATLAS tracker. The current tracker consists of ~4000 modules while ITk will be made of ~28,000 modules. To ensure a good production quality, all the items to build modules as well as bigger structures on which they will be placed need to be tracked along with the relevant quality control and quality assurance information. Hence, the ITk production database (PDB) is vital to follow the complex production flow for each item and institutes around the globe. The database also allows close monitoring of the production quality and production speed. After production the information will be stored for 10 years of data-taking to trace potential operational issues to specific production items.
A PDB API allows development of tools for database interaction by different user types: technicians, academics, engineers and vendors. Several options have been pursued to meet the needs by the collaboration: pythonic API wrapper, data-acquisition GUIs with integrated scripts, commandline scripts distributed via git repositories, containerised applications, and CERN hosted resources.
This presentation promotes information exchange and collaboration for tools which supports detector construction in a large-scale experiment. Examples of front-end development and reporting will be shown. Through these examples, the general themes of large-scale data management and multi-user global accessibility will be discussed. These concepts are relevant not only for modern high-energy particle physics (HEP) but also for large experiments beyond HEP.
XRootD is a robust, scalable service that supports globally distributed data management for diverse scientific communities. Within GridPP in the UK, XRootD is used by the Astronomy, High-Energy Physics (HEP) and other communities to access >100PB of storage. The optimal configuration for XRootD varies significantly across different sites due to unique technological frameworks and site-specific factors.
XRootD's adaptability has made it a cornerstone of the national data-management strategy for GridPP. Given its high-profile role, new releases, and features of XRootD undergo rigorous testing and verification before national deployment. Historically, this process involved manual integration testing and dedicated test deployments, which required substantial input from both local site administrators and remote support teams. This approach has placed considerable demands on support staff, requiring extensive technical expertise and significant time for verification.
To support the storage community within GridPP, we have developed a system that automates the deployment of a virtual grid using Kubernetes for XRootD testing, "XKIT". Using a container-based approach this system enables high-level integration tests to be performed automatically and reproducibly. This not only simplifies the support process but also significantly reduces the time staff need to dedicate to repetitive testing for new deployments.
We have identified >20 unique XRootD configurations necessary for XKIT. By deploying each of these setups on our platform, we aim to provide the GridPP community with a consistent suite of functional tests tailored to various site topologies.
This presentation will explore the development of the XKIT platform, discuss the challenges we encountered, and highlight the advantages this system offers to GridPP and the wider community.
For over two decades, the dCache project has provided open-source to satisfy ever-more demanding storage requirements. More than 80 sites around the world, rely on dCache to provide services for LHC experiments, Belle-II, EuXFEL and many others. This can be achieved only with a well-established process from a whiteboard, where ideas are created, through development, packaging and testing. The project's build and test infrastructure is based on Jenkins CI and a set of virtual machines. This infrastructure is maintained by dCache developers. With the introduction of the DESY-central Gitlab server, the developers have started migrating from VM-based testing to container-based deployments in the onsite Kubernetes cluster. As a result, we have packaged dCache containers and Helm charts that can be used by other sites to reproduce our test and build steps quickly or to evaluate new releases on their pre-production systems, and, eventually, become a standard model of dCache deployment at the sites.
This presentation will show challenges that we have faced, the techniques how they were solved and issues that still need to be addressed.
Nearly none of the models from partial wave analysis can be reproduced based on published papers due to omitted nuances and implementation details. This issue affects progress and reliability in high-energy physics. Our project addresses this by standardizing the serialization of amplitude models into a lightweight, human-readable format, starting with three-body decay analyses. This standardization ensures accurate model verification, addressing common issues found even in published research.
We have developed a centralized repository containing a collection of models from LHCb and COMPASS analyses using ThreeBodyDecays.jl, ComPWA, and TFAnalysis. The serialized models facilitate community reanalysis, enable the use of models in MC generators, and provide data for testing and benchmarking new frameworks. We employ the Pixi setup for reproducible package management across platforms and Quarto for multilanguage support. Julia and Python notebooks run different frameworks, enhancing analysis and visualization.
As a fresh development, we see potential for broader adoption, including a possible extension to ROOT via the HS3 initiative. Success requires community support and collaboration with framework developers, advancing transparent, reproducible research.
Further details are available on our project webpage.
ROOT is an open source framework, freely available on GitHub, at the heart of data acquisition, processing and analysis of HE(N)P experiments, and beyond.
It is developed collaboratively: contributions are not authored only by ROOT team members, but also by a veritable nebula of developers and scientists from universities, labs as well as the private sector. More than 1500 GitHub Pull Requests are merged on average per year. It is in this context that code integration acquires a primary role: not only code contributions need to be reviewed, but they need to be thoroughly tested through a powerful CI infrastructure on several different platforms to comply with the high code quality standards of the project. Since the end of 2023, ROOT moved its continuous integration system from a Jenkins one to a GitHub Actions based one.
In this contribution, we characterise the transition to the GitHub CI, focussing our strategy, its implementation and the lesson learned, as well as the advantages the new system offers with respect to the previous one. Particular emphasis will be given to the evaluation of the cost-benefit ratio for Jenkins and GitHub Actions for the ROOT project. We’ll also describe how we manage to run in less than one hour thousands of unit, integration, functional and end-to-end tests on different flavours of Windows, four versions of macOS, as well as about ten of the most used Linux distributions, taking advantage of the CERN computing infrastructure.
Collaboration, Reinterpretation, Outreach and Education
The CMS experiment at the Large Hadron Collider (LHC) regularly releases open data and simulations, enabling a wide range of physics analyses and studies by the global scientific community. The recent introduction of the NanoAOD data format has provided a more streamlined and efficient approach to data processing, allowing for faster analysis turnaround. However, the larger MiniAOD format retains richer information that may be crucial for certain research endeavors.
To ensure the long-term usability of CMS open data to their full extent, this work explores the potential of leveraging public cloud resources for the computationally intensive processing of the MiniAOD format. Many open data users may not have access to the necessary computing resources for handling the large MiniAOD datasets. By offloading the heavy lifting to scalable cloud infrastructure, researchers can benefit from increased processing power and improved overall efficiency in their data analysis workflows, with a moderate short-term cost.
The study investigates best practices and challenges for effectively utilizing public cloud platforms to handle the processing of CMS MiniAOD data, with a focus on quantifying the overall time and cost of using these resources. The ultimate aim is to empower the CMS open data community to maximize the scientific impact of this valuable resource.
The Large Hadron Collider Beauty (LHCb) experiment offers an excellent environment to study a broad variety of modern physics topics. Its data from the major physics campaigns (Run 1 and 2) at the Large Hadron Collider (LHC) has accumulated over 600 scientific publications. In accordance with the CERN Open Data Policy, LHCb announced the release of the full Run 1 dataset gathered from proton-proton collisions, amounting to approximately 800 terabytes. The Run 1 data was released on the CERN Open Data portal in 2023. However, due to the large amount of data collected during Run 2, it is no longer feasible to make the reconstructed data accessible to the public in the same way.
We have, therefore, developed a new and innovative approach to publishing Open Data by means of a dedicated LHCb Ntupling Service which allows third-party users to query the data collected by LHCb and request custom samples in the same columnar data format used by LHCb physicists. These samples are called Ntuples and can be individually customized in the web interface using LHCb standard tools for saving measured or derived quantities of interest. The configuration output is kept in a pure data structure format (YAML) and is interpreted by internal parsers generating the necessary Python scripts for the LHCb Ntuple production job. In this way, the LHCb Ntupling Service serves as a gateway for third-party users for preparing custom Ntuple jobs eliminating the need for real-time interaction with the LHCb database and solving potential access control and computer security issues related to opening LHCb internal tools to the public.
The LHCb Ntupling Service was developed as a collaborative effort by LHCb and the CERN Open Data team from the CERN Department of Information Technology. The service consists of the web interface frontend allowing users to create Ntuple production requests, the backend application processing the user requests and storing them in the GitLab repositories, offering vetting capabilities to the LHCb Open Data team, and automatically dispatching user requests to the LHCb Ntuple production systems after the approval. The produced Ntuples are then collected and exposed back to the users by the frontend web interface.
This talk is a joint presentation by LHCb and CERN IT and will elaborate on the LHCb Ntupling Service system infrastructure as well as its typical use case scenarios allowing to query and study the LHCb open data.
ATLAS Open Data for Education delivers proton-proton collision data from the ATLAS experiment at CERN to the public along with open-access resources for education and outreach. To date ATLAS has released a substantial amount of data from 8 TeV and 13 TeV collisions in an easily-accessible format and supported by dedicated documentation, software, and tutorials to ensure that everyone can access and exploit the data for different educational objectives. Along with datasets, ATLAS also provides data visualisation tools and interactive web based applications for studying the data, along with Jupyter Notebooks and downloadable code enabling users to further analyse data for known and unknown physics cases. The Open Data educational platform which hosts the data and tools is used by tens of thousands of students worldwide, and we present the project development, lessons learnt, impacts, and future goals.
High Energy (Nuclear) Physics and Open Source are a perfect match with a long history. CERN has created an Open Source Program Office (CERN OSPO [1]) to help open-source hardware and software in the CERN community - for CERN staff and the experiments’ users. In the wider context, open source and CERN’s OSPO have key roles in CERN’s Open Science Policy [2]. With the OSPO, open-source projects should have more visibility inside and outside the organization, as contributions to society; the OSPO’s team of practitioners want to make open source at CERN an easier, obvious task.
This presentation will provide you with an overview of the mission and objectives of the CERN Open Source Program Office (OSPO). This contribution exposes how the OSPO can and needs to help, what the OSPO wants to achieve, and what an OSPO’s role might be in the HE(N)P software ecosystem. After more than a year in active engagement, we will share insights encountered so far, including the different challenges of open source in different parts of CERN. The presentation will share some behind-the-scenes stories: what the challenges were in creating it, what makes it special compared to other OSPOs, and why the OSPO won’t do some things you might expect it to do. We will present the initial set of technical recommendations (“best practices”) as proposed by the CERN OSPO; some alignment across institutions might be beneficial for the global HE(N)P community.
By sharing the CERN OSPO’s journey, challenges, and lessons learned, we hope to provide valuable insights relevant to other HE(N)P centers, open-source projects, and the wider open source community.
[1] https://opensource.cern/mandate
[2] https://openscience.cern/policies
The CERN Open Data Portal holds over 5 petabytes of high-energy physics experiment data, serving as a hub for global scientific collaboration. Committed to Open Science principles, the portal aims to democratize access to these datasets for outreach, training, education, and independent research.
Recognizing the limitations of current disk-based storage, we are starting a project to expand our data storage methodologies. Our approach involves integrating hot storage (such as spinning disks) for immediate data access and cold storage (such as tape, or even interfaces to the experiment frameworks) for cost-effective long-term preservation. This innovative strategy will significantly expand the portal’s capacity to accommodate more experiment data. However, we anticipate challenges in navigating technical complexities and logistical hurdles. These challenges include the latency to access cold data, monitoring and automatizing the transitions between hot and cold and ensuring the long-term preservation of data in the experiment frameworks. The strategy is to integrate existing solutions like EOS, FTS, CTA and Rucio.
In our presentation, we will discuss these challenges, present our prototype solution, and outline future developments aimed at enhancing the accessibility, efficiency, and resilience of the CERN Open Data Portal’s data ecosystem.
In recent years, there has been significant political and administrative interest in “Open Science”, which on one hand has lead to additional obligations but also to significant financial backing. For institutes and scientific collaborations, the funding opportunities may have brought some focus on these topics, but there is also a the significant hope, though engagement in open science infrastructure and culture, a possible multiplying effect on scientific output though the sharing of knowledge among and between scientists and citizens.
The Facility for AntiProton and Ion Research in Europe (FAIR) is a particle accelerator just outside Darmstadt in Germany, which is under final construction at a site adjacent to the GSI Helmholtz Centre for Heavy Ion Research. One of its five scientific pillars is the Compressed Baryonic Matter (CBM) experiment, which is now prioritised and expected to receive its first beam in 2028. For CBM, as a leading international scientific collaboration, an active open science policy is an imperative.
In this contribution, we outline our fully-formed policy towards “Open Software” and describe how we overcame difficulties to facilitate a F.A.I.R.-level of openness. We discuss the internally controversial issue of “Open Data” and the availability to technically test data policies at the prototype experiment mini-CBM, before application to our more important physics-rich data coming from our future world-class experiment. Lastly we discuss what it means to be an “Open Collaboration” and how engagement in open science strategy within the collaboration could facilitate a plethora of new citizen science projects and help progress our research and the open science agenda.
The poster will present FunRootAna library.
This is a simple framework allowing to do ROOT analysis in a more functional way. In comparison to RDFrame it offers more functional feel for the data analysis and can be used in any circumstances, not only with ROOT trees. Collections processing is inspired by Scala Apache Spark and the histograms creation and filling is much simplified. As consequence, a single line containing selection, data extraction & histogram definition is sufficient to obtain one unit of result.Basically, with FunRootAna the number of lines of analysis code per histogram is converging to 1. More here: https://tboldagh.github.io/FunRootAna/
The ATLAS detector produces a wealth of information for each recorded event. Standard calibration and reconstruction procedures reduce this information to physics objects that can be used as input to most analyses; nevertheless, there are very specific analyses that need full information from some of the ATLAS subdetectors, or enhanced calibration and/or reconstruction algorithms. For these use cases, a novel workflow has been developed that involves the selection of events satisfying some basic criteria, their extraction in RAW data format using the EventIndex data catalogue and the Event Picking Server, and their specialised processing. This workflow allows us in addition to commission and use new calibration and reconstruction techniques before launching the next full reprocessing (important given the longer and longer expected time between full reprocessing campaigns), to use algorithms and tools that are too CPU or disk intensive if run over all recorded events, and in the future to apply AI/ML methods that start from low-level information and could profit from rapid development/use cycles. This presentation describes the tools involved, the procedures followed and the current operational performance.
Beijing Spectrometer (BESIII) detector is used for high-precision studies of hadron physics and tau-charm physics. Accurate and reliable particle identification (PID) is crucial to improve the signal-to-noise ratio, especially for K/π separation. The time-of-flight (TOF) system, which is based on plastic scintillators, is a powerful tool for particle identification at BESIII experiment. The measured time is obtained using an empirical formula, which is used for time walk and hit position corrections, with Bhabha events used as calibration samples. Time difference is defined as the difference between the measured time and the expected time. Systematic time deviations of charged hadrons have been observed in the time differences for different particle species. This kind of systematic time deviation, which depends on the momentum and particle species, has been reported in several experiments using TOF based on plastic scintillation counters. Similar behaviors have also been observed in simulations with different deviations. In this study, the dependence of time deviations on pulse heights and hit positions is systematically investigated using different species of hadron control samples. By applying corrections to the measured time, the time deviations are substantially reduced to nearly zero. The PID efficiencies of hadrons are enhanced both for real data and MC samples, and the systematic uncertainties of PID efficiencies are also optimized with further tuning. This study offers a new perspective on investigating time deviation in scintillation TOF detectors and provides a reference for improving detection accuracy.
The Level-1 Data Scouting (L1DS) is a novel data acquisition subsystem at the CMS Level-1 Trigger (L1T) that exposes the L1T primitives involved in event selection for online processing at the LHC’s full 40 MHz bunch-crossing rate, enabling zero-bias and unconventional analyses. Since Run 3, a L1DS demonstrator has relied on a shared ramdisk for its incoming and intermediate data. While the HL-LHC and CMS’ Phase 2 upgrade are projected to augment trigger resolutions, processing and storage concerns have prompted the development of a new L1DS processing pipeline with a performant shared memory lake. For this, we leverage the emerging Compute Express Link (CXL) standard, whose protocols provide uniform, cache-coherent memory access to heterogeneous processing units, e.g., smart NICs and GPUs. In this contribution, we present the integration of CXL-compliant shared memory into the L1DS pipeline at CMS, expounding on the observed benefits and limitations of our approach. Furthermore, we perform a comprehensive evaluation of the demonstrator system’s performance in realistic analyses and discuss use cases for the CMS community.
Vector is a Python library for 2D, 3D, and Lorentz vectors, especially arrays of vectors, to solve common physics problems in a NumPy-like way. Vector currently supports creating pure Python Object, NumPy arrays, and Awkward arrays of vectors. The Object and Awkward backends are implemented in Numba to leverage JIT-compiled vector calculations. Furthermore, vector also supports JAX and Dask operations on Awkward arrays of vectors.
We introduce a new SymPy backend in vector to allow symbolic computations on high energy physics vectors. Along with experimental physicists using vector for numerical computations, the SymPy backend will enable theoretical physicists to utilize the library for symbolic computations. Since the SymPy vector classes and their momentum equivalents operate on SymPy expressions, all of the standard SymPy methods and functions work on the vectors, vector coordinates, and the results of operations carried out on vectors. Moreover, vector’s SymPy backend will create a stronger connection between software used by experimentalists and software used by theorists.
This talk will introduce vector and its backends to the users and funnel down to the SymPy backend. Finally, vector’s SymPy backend is relatively new; hence, we aim to collect suggestions and recommendations from both theoretical and experimental physicists.
New strategies for the provisioning of compute resources, e.g. in the form of dynamically integrated resources enabled by the COBalD/TARDIS software toolkit, require a new approach of collecting accounting data. AUDITOR (AccoUnting DatahandlIng Toolbox for Opportunistic Resources), a flexible and expandable accounting ecosystem that can cover a wide range of use cases and infrastructures, was developed specifically for this purpose. Accounting data is collected via so-called collectors and stored in a database. So-called plugins can access the data and act based on the accounting information. Access to the data is handled by the core component of AUDITOR, which provides a REST API together with a Rust and a Python client library.
An HTCondor collector, a Slurm collector and a TARDIS collector are currently available, and a Kubernetes collector is already in the works.
The APEL plugin enables, for example, the creation of APEL accounting summaries and their transmission to the APEL accounting server. Although the original aim of the development of AUDITOR was to enable the accounting of opportunistic resources managed by COBalD/TARDIS, it can also be used for standard accounting of a WLCG computing resource. As AUDITOR uses a highly flexible data structure to store accounting data, extensions such as GPU resource accounting can be added with minimal effort.
This contribution provides insights into the design of AUDITOR and shows how it can be used to enable a number of different use cases.
The aim of this paper is to give an overview of the progress made in the EOS project - the large scale data storage system developed at CERN - during the preparation and during LHC Run-3. Developments consist of further simplification of the service architecture, metadata performance improvements, new memory inventory and cost & value interfaces, a new scheduler implementation, a generated REST API derived from the GRPC protocol, and new or better integration of features such as SciTags and SciTokens. We will report on operational experiences and the massive migration process to ALMA9, improvements in the quality assurance process and results achieved. Looking to the future, we will describe the development and evolution of EOS for Run-4 and highlight various software R&D and technology evaluation activities (e.g. SMR support) that have the potential to help realize the Run-4 requirements for physics storage at CERN and elsewhere.
The CMS Experiment at the CERN Large Hadron Collider (LHC) relies on a Level-1 Trigger system (L1T) to process in real time all potential collisions, happeing at a rate of 40 MHz, and select the most promising ones for data acquisition and further processing. The CMS upgrades for the upcoming high-luminosity LHC run will vastly improve the quality of the L1T event reconstruction, providing opportunities for a complementary Data Scouting approach where physics analysis is performed on a data stream containing all collisions but limited to L1T reconstruction. This poster describes the future Data Scouting system, some first estimates of its physics capabilities, and the demonstration setups used to assess its technical feasibility.
The CMS experiment has recently established a new Common Analysis Tools (CAT) group. The CAT group implements a forum for the discussion, dissemination, organization and development of analysis tools, broadly bridging the gap between the CMS data and simulation datasets and the publication-grade plots and results. In this talk we discuss some of the recent developments carried out in the group, including the structure of the group, the facilities and services provided, the communication channels, the ongoing developments in the context of frameworks for data processing, strategies for the management of analysis workflows and their preservation and tools for the statistical interpretation of analysis results.
The recently approved SHiP experiment aims to search for new physics at the intensity frontier, including feebly interacting particles and light dark matter, and perform precision measurements of tau neutrinos.
To fulfill its full discovery potential, the SHiP software framework is crucial, and faces some unique challenges due to the broad range of models under study, and the extreme statistics necessary for the background studies. The SHiP environment also offers unique opportunities for machine learning for detector design and anomaly detection.
This talk will give an overview of the general software framework and of past, ongoing and planned simulation and machine learning studies.
Data analysis in the field of High Energy Physics presents typical big data requirements, such as the vast amount of data to be processed efficiently and quickly. The Large Hadron Collider in its high luminosity phase will produce about 100 PB/year of data, ushering in the era of high precision physics. Currently, analysts are building and sharing their software on git-based platforms which improve reproducibility and offer a high level of workflow automatization. On the other hand, it’s becoming more and more critical to complement this aspect with an easy and user-friendly access to distributed resources for CPU-intensive calculations. In this talk, it will be shown how it is possible to enable Continuous Integration (CI) with CMS datasets by using the XRootD IO protocol and dynamic proxy generation and, in combination with the GitLab CI/CD functionalities, how to trigger an analysis execution with a simple commit. By using dynamic auth access tokens it’s possible to offload all the CPU-heavy work from the gitlab workers to on-demand computing resources: from regional CMS Tier-2 resources to the national-wide datalake model currently under deployment within the ICSC (the italian national center for research in HPC, big data and quantum computing) project. Thanks to this alternative approach, in particular, integrating the submission of jobs to HTCondor into the gitlab CI will become easier, automatising the handling of big datasets. In this way analysts will be able to quickly run different tests on their data, perform different analyses in parallel and, at the same time, keep tracks of all the changes made.
The BESIII experiment operates as an electron-positron collider in the tau-charm energy region, pursuing a range of physics goals related to charm, charmonium, light hadron decays, and so on. Among these objectives, achieving accurate particle identification (PID) plays a crucial role, ensuring both high efficiency and low systematic uncertainty. In the BESIII experiment, PID performance heavily relies on two key measurements: the energy deposit per unit length (dE/dx) obtained from the main drift chamber (MDC) sub-detector, and the time of flight (TOF) measurement from the TOF sub-detector.
This contribution focuses specifically on the dE/dx aspect and provides a comprehensive overview of the dE/dx software employed in the BESIII experiment. The presentation encompasses simulation, calibration, and reconstruction techniques implemented in the analysis pipeline. Last but not least, with the help of machine learning (ML) technique, a study of ML-based dE/dx simulation will also be presented.
A modern version control system is capable of performing Continuous Integration (CI) and Continuous Deployment (CD) in a safe and reliable manner. Many experiments and software projects of High Energy Physics are now developing based on such modern development tools, GitHub for example. However, refactoring a large-scale running system can be challenging and difficult to execute. This is the reason why the BES Offline Software System (BOSS) continues to be developed using an outdated version control system, specifically, Concurrent Versions System (CVS). CVS does not automatically check the committed code during the commit process. To address this issue, a new auto-validation system has been developed, which overrides parts of the 'cvs' subcommand, enabling automatic code checks immediately after committing. Besides, with the integration of Gitlab, it includes functions designed for the convenience of developers and system managers, allowing them to work on multiple tasks simultaneously and automatically collects validated code. This approach strikes a balance between stability and innovation, allowing developers and system managers to enjoy the benefits of a modern-like version control system without having to much alter their work habits. The system is currently in use for the development and maintenance of BOSS.
Users may have difficulties to find the needed information in the documentation for products, when many pages of documentation are available on multiple web pages or in email forums. We have developed and tested an AI based tool, which can help users to find answers to their questions. The Docu-bot uses Retrieval Augmentation Generation solution to generate answers to various questions. It uses github or open gitlab repositories with documentation as a source of information. Zip files with documentation in a plain text or markdown format can also be used for input. Sentence transformer model and Large Language Model generate answers.
Different LLM models can be used. For performance reasons, in most tests we use the model Mistral-7B-Instruct-v0.2, which fits into the memory of the Nvidia T4 GPU. We have also tested a larger model Mixtral-8x7B-Instruct-v0.1, which requires more GPU memory, available for example on Nvidia A100, A40 or H100 GPU cards. Another possibility is to use the API of OpenAI models like gpt-3.5-turbo, but the user has to provide his/her own API access key to cover expenses.
EDM4hep aims to establish a standard event data model for the store and exchange of event data in HEP experiments, thereby fostering collaboration across various experiments and analysis frameworks. The Julia package EDM4hep.jl is capable of generating Julia-friendly structures for the EDM4hep data model and reading event data files in ROOT format (either TTree or RNTuple) that are written by C++ programs, utilising the UnROOT.jl package. This paper explores the motivations behind the primary design choices of this package, such as the exclusive use of structure of arrays (SoA) to access the stored collections, which then empower users to develop ergonomic data analyses using Julia’s high-level concepts and functionality, while maintaining performance comparable to C++ programs. Several examples are given to illustrate how efficient data analysis can be achieved using high-level objects, eliminating the need to resort to flat n-tuples.
2024 marks not just CERN’s 70th birthday but also the end of analogue telephony at the laboratory. Traditional phone exchanges and the associated copper cabling cannot deliver 21st-century communication services and a decade-long project to modernize CERN’s telephony infrastructure was completed earlier this year.
We report here on CERN’s modern fixed telephony infrastructure, firstly our in-house development of an exchange which, based on open-source components and standard VoIP protocols, supports softphones, call centers, safety communications, interconnections with other voice services and an automatic switchboard, and secondly the two CERNphone applications that have replaced fixed phones, and which are used by more than 6000 users each week.
The dCache storage management system at Brookhaven National Lab plays a vital role as a disk cache, storing extensive datasets from high-energy physics experiments, mainly the ATLAS experiment. Given that dCache’s storage is significantly smaller than the total ATLAS data, it’s crucial to have an efficient cache management policy. A common approach is to keep files that are accessed often, ready for future use. In our research, we analyze both recent and past patterns of file usage to predict the chances of them being needed again. Although dCache considers each file separately, we’ve observed that files within a dataset tend to be used together. Therefore, the system manager often gets requests to retain entire datasets in the cache, especially if they’re expected to be in high demand soon. Our main focus is to determine if we could accurately forecast a dataset’s future demand to automate the process of deciding which datasets to prioritize in the cache.
Our approach’s cornerstone is a dynamic learning mechanism that regularly analyzes recent access logs. This process updates our machine learning models, enabling them to forecast the popularity of various datasets shortly. Specifically, our predictive model estimates the expected number of accesses for each dataset in the upcoming days. We then synchronize these predictions with the cache space allocated for monitoring sought-after datasets. This allows us to proactively load the most in-demand datasets into the disk cache. This strategic reservation method operates in conjunction with the current file removal policy, collectively improving the overall efficiency of the system.
To develop a predictive model for our caching system, we assessed several techniques and metrics to distinguish popular datasets from less popular ones effectively. Employing k-means clustering, we categorized datasets based on their popularity and explored diverse methods to precisely measure dataset usage. Given our constrained disk space, our aim was to optimize the selection of retained datasets, thereby improving cache efficiency.
Prior study [1] has demonstrated the feasibility of detecting popular datasets using a machine learning approach. In this study, we compare the predictive efficacy of two distinct models: a neural network model and a gradient-boosted trees regression model (XGBoost). The models, configured with 17 input variables, are trained on 127 million data points, collected over a span of three years from our data processing pipeline. Additionally, both models underwent hyperparameter tuning via Optuna, conducted on Perlmutter at NERSC.
Fig. 1. December 2023 comparison of predicted and actual dataset accesses using XGBoost. Axes represent next-day actual (x) vs. predicted (y) accesses. Points are colored based on the recency of last access, with lighter points indicating predictions are made with older data records. The red diagonal line indicates perfect predictions. A high correlation coefficient (0.84) reflects strong prediction accuracy, especially at higher access counts.
Despite the inherent difficulty in forecasting future dataset accesses, our models showed promising performance. Notably, the XGBoost model displayed a lower root mean squared error (RMSE) for testing datasets compared to the neural network. Specifically, the relative ratios of testing RMSE to standard deviation were 0.28 for XGBoost and 0.84 for the neural network models.
Our research confirms that predicting dataset popularity is feasible through careful analysis of data features and the application of well-designed models. While the real-world application of these models in live caching policies requires further testing, our study underscores the potential of machine learning in improving dCache systems. Future endeavors will concentrate on implementing, benchmarking, and validating the efficacy of these proposed methods.
REFERENCES
[1] J. Bellavita, C. Sim, K. Wu, A. Sim, S. Yoo, H. Ito, V. Garonne, and E. Lancon, "Understanding data access patterns for dcache system," in 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023), 2023.
How to effectively and efficiently stage a large number of requests from an IBM HPSS environment using a MariaDB database to keep track of requests and use Python for all business logic and to consume the HPSS API. The goal is to be able to scale to handle a large number of requests and to meet different needs of different experiments, and to make the program adaptable enough to allow for each experiment to have its own unique business logic. This update will take advantage of features of the newest versions of HPSS, as well as MariaDB, Python, and Linux. Furthermore, the hope is that the application will be able to log and handle a wider array of errors and exceptions, and allow for more in depth monitoring as the status of each request will be stored in a database which allows for easy querying. Furthermore this may allow for additional enhancements such as staging requests by priority.
The interTwin project, funded by the European Commission, is at the forefront of leveraging 'Digital Twins' across various scientific domains, with a particular emphasis on physics and earth observation. One of the most advanced use-cases of interTwin is event generation for particle detector simulation at CERN. interTwin enables particle detector simulations to leverage AI methodologies on cloud to high-performance computing (HPC) resources by using itwinai - the AI workflow and method lifecycle module of interTwin.
The itwinai module, a comprehensive solution for AI workflow and method lifecycle developed collaboratively by CERN and the Julich Supercomputing Center (JSC), serves as the cornerstone for researchers, data scientists, and software engineers engaged in developing, training, and maintaining AI-based methods for scientific applications, such as the particle event generation. Its role is advancing interdisciplinary scientific research through the synthesis of learning and computing paradigms. This framework stands as a testament to the commitment of the interTwin project towards co-designing and implementing an interdisciplinary Digital Twin Engine. Its main functionalities and contributions are:
Distributed Training: itwinai offers a streamlined approach to distributing existing code across multiple GPUs and nodes, automating the training workflow. Leveraging industry-standard backends, including PyTorch Distributed Data Parallel (DDP), TensorFlow distributed strategies, and Horovod, it provides researchers with a robust foundation for efficient and scalable distributed training. The successful deployment and testing of itwinai on JSC's HDFML cluster underscore its practical applicability in real-world scenarios.
Hyperparameter Optimization: One of the core functionalities of itwinai is its hyperparameter optimization, which plays a crucial role in enhancing model accuracy. By intelligently exploring hyperparameter spaces, itwinai eliminates the need for manual parameter tuning. The functionality, empowered by RayTune, contributes significantly to the development of more robust and accurate scientific models.
Model Registry: A key aspect of itwinai is its provision of a robust model registry. This feature allows researchers to log and store models along with associated performance metrics, thereby enabling comprehensive analyses in a convenient manner. The backend, leveraging MLFlow, ensures seamless model management, enhancing collaboration and reproducibility.
In line with the “Computing infrastructure” track of CHEP 2024, interTwin and its use-cases empowered by itwinai are positioned at the convergence of computation and physics and showcase the significant potential of AI research supported by HPC resources. Together, they contribute to a narrative of interconnected scientific frontiers, where the integration of digital twins, AI frameworks, and physics research broadens possibilities for exploration and discovery through itwinai’s user-friendly interface and powerful functionalities.
In conclusion, itwinai is a valuable and versatile resource, empowering researchers and scientists to embark on collaborative and innovative scientific research endeavors across diverse domains. The integration of physics-based digital twins and AI frameworks broadens possibilities for exploration and discovery through itwinai's user-friendly interface and powerful functionalities.
Machine Learning (ML)-based algorithms play increasingly important roles in almost all aspects of data processing in the ATLAS experiment at CERN. Diverse ML models are used in detector simulation, event reconstruction, and data analysis. They are being deployed in the ATLAS software framework, Athena. Our primary approach to perform ML inference in Athena is to use ONNXRuntime. ONNXRuntime is a cross-platform ML model acceleration library, with a flexible interface to integrate hardware-specific libraries. In this talk, we will describe the ONNXRuntime interface in Athena and the impact of advanced ONNXRuntime settings on various ML models and workflows at ATLAS.
CMS Analysis Database Interface (CADI) is a management tool for physics publications in the CMS experiment. It acts as a central database for the CMS collaboration, keeping track of the various analysis projects being conducted by researchers. Each analysis paper written by the authors goes through an extensive journey from early analysis to publication. There are various stakeholders involved in that process who can provide their comments/feedback and may be involved in the approval/disapproval process of the analysis. Front End Engine for Glance (FENCE) is a technology developed by the UFRJ team that emerged to unify and facilitate the development of UFRJ-CERN collaboration systems. It allows system interfaces to be created by simply editing a configuration file in JSON, without the need for deep programming knowledge of users and changing the system's internal source code. Thus, the current system of ATLAS experiment, which uses the Glance technology in its foundation and FENCE as an abstraction layer above, is developed, allowing users to access the heterogeneous data sources related to the experiments in a simple and efficient way. Originally developed by ATLAS, it was recently redesigned by LHCb following a more modular architecture – splitting the code base in a PHP based RestAPI backend and a VueJS based frontend service – and this version was also adopted for use in the LHCb and Alice experiments. CMS decided to migrate CADI to the new version of the FENCE system. For CMS, two subsystems of the FENCE system are initially considered: the “membership” and “analysis life cycle management” (ALCM). The membership subsystem is a prerequisite of ALCM. It contains information on members, institutes, authorships, and various reports. In contrast, the ALCM subsystem is primarily used for the management of publication workflows like CADI. In this talk, we’ll describe the procedure that we followed to migrate CADI to FENCE. We encountered various issues during this process and will report the lessons learned while doing this migration so that other experiments in future will not have to undergo these issues if they migrate their system to FENCE.
Graph neural networks (GNN) have emerged as a cornerstone of ML-based reconstruction and analysis algorithms in particle physics. Many of the proposed algorithms are intended to be deployed close to the beginning of the data processing chain, e.g. in event reconstruction software of running and future collider-based experiments. For GNN to operate, the input data are represented as graphs. The creation of the graphs and the associated cost are often limiting factors in high-throughput production environments. We discuss the specific example of charged-particle track reconstruction in the ATLAS detector. The HL-LHC upgrade of the ATLAS detector brings an unprecedented track reconstruction challenge, both in terms of the large number of silicon hit cluster readouts, and the throughput required. The GNN4ITk project has designed GNN-based algorithms for tracking with a similar level of physics performance to traditional techniques, that scale sub-quadratically, provided that the large input graphs can be created efficiently. In this contribution, we present novel methods that are able to produce these graphs quickly and efficiently, and describe their computing performance.
Monte Carlo (MC) simulations are a crucial component when analysing the Standard Model and New physics processes at the Large Hadron Collider. The goal of this work is to explore the performance of generative models for complementing the statistics of classical MC simulations in the final stage of data analysis by generating additional synthetic data that follows the same kinematic distributions for a limited set of analysis-specific observables to a high precision. A normalizing flow architecture was adapted for this task and its performance was systematically evaluated using a well-known benchmark sample containing the Higgs boson production beyond the Standard Model and the corresponding irreducible background. The applicability of normalizing flows under different model parameters and a restricted number of initial events used in training was investigated. The best performing model was then chosen for further evaluation with a set of statistical procedures and a simplified physics analysis. We demonstrate that the the number of events used in training coupled with the flow architecture are crucial for the physics performance of the generative model. By implementing and performing a series of statistical tests and evaluations we show that a machine-learning-based generative procedure can can be used to generate synthetic data that matches the original samples closely enough and that it can therefore be incorporated in the final stage of a physics analysis with some given systematic uncertainty.
In response to increasing data challenges, CMS has adopted the use of GPU offloading at the High-Level Trigger (HLT). However, GPU acceleration is often hardware specific, and increases the maintenance burden on software development. The Alpaka (Abstraction Library for Parallel Kernel Acceleration) portability library offers a solution to this issue, and has been implemented into the CMS software (CMSSW) for use online at HLT.
A portion of the final-state particle candidate reconstruction algorithm, Particle Flow, has been ported to Alpaka and deployed at HLT for 2024 data taking. The formation of hadronic Particle Flow clusters represented a target for increased performance through parallel operation. We will discuss the port of hadronic Particle Flow clustering to Alpaka, and the validation of physics and performance at HLT.
With the upcoming upgrade of High Luminosity LHC, the need for computation
power will increase in the ATLAS trigger system by more than an order of
magnitude. Therefore, new particle track reconstruction techniques are explored
by the ATLAS collaboration, including the usage of Graph Neural Networks (GNN).
The project focusing on that research, GNN4ITk, considers several heterogeneous
computing options, including the usage of Graphics Processing Units (GPU). The
framework can reconstruct tracks with high efficiency, however, the computing
requirements of the pipeline are high. We will report on the efforts to reduce
the memory consumption and inference time enough to enable the usage of
commercially available and affordable GPUs for the future ATLAS trigger system
while maintaining high tracking performance.
The escalating demand for data processing in particle physics research has spurred the exploration of novel technologies to enhance efficiency and speed of calculations. This study presents the development of a porting of MADGRAPH, a widely used tool in particle collision simulations, to FPGA using High-Level Synthesis (HLS).
Experimental evaluation is ongoing, but preliminary assessments suggest a promising enhancement in calculation speed compared to traditional CPU implementations. This potential improvement could enable the execution of more complex simulations within shorter time frames.
This study describes the complex process of adapting MADGRAPH to FPGA using HLS, focusing on optimizing algorithms for parallel processing. A key aspect of the FPGA implementation of the MADGRAPH software is reduction of the power consumption, which important implications for the scalability of computer centers and for the environment. These advancements could enable faster execution of complex simulations, highlighting FPGA's crucial role in advancing particle physics research and its environmental impact.
Deep sets network architectures have useful applications in finding
correlations in unordered and variable length data input, thus having the
interesting feature of being permutation invariant. Its use on FPGA would open
up accelerated machine learning in areas where the input has no fixed length or
order, such as inner detector hits for clustering or associated particle tracks
for jet tagging. We adapted DIPS (Deep Impact Parameter Sets), a deep sets
neural network flavour tagging algorithm previously used in ATLAS offline
low-level flavour tagging and online b-jet trigger preselections, for use on
FPGA with the aim to assess its performance and resource costs. QKeras and
HLS4ML are used for quantisation-aware training and translation for FPGA
implementation, respectively. Some challenges are addressed, such as finding
replacements for functionality not available in HLS4ML (e.g. Time Distributed
layers) and implementations of custom HLS4ML layers. Satisfactory
implementations are tested on an actual FPGA board for the assessment of true
resource consumption and latency. We show the optimal FPGA-based algorithm
performance relative to CPU-based full precision performance previously
achieved in the ATLAS trigger, as well as performance trade-offs when reducing
FPGA resource usage as much as possible. The project aims to demonstrate a
viable solution for performing sophisticated Machine Learning-based tasks for
accelerated reconstruction or particle identification for early event rejection
while running in parallel to other more intensive tasks on FPGA.
Simulation of the detector response is a major computational challenge in modern High-Energy Physics experiments, accounting for about 40% of the total computational resources used in ATLAS. The simulation of the calorimeter response is particularly demanding, consuming about 80% of the total simulation time.
In order to make the best use of the available computational resources, fast simulation tools based on Machine Learning techniques have been developed to simulate the calorimeter response faster than Geant4 while maintaining a high level of accuracy. One such tool, developed by the ATLAS Collaboration and currently in production for LHC Run 3, is FastCaloGAN, which uses Generative Adversarial Networks (GANs) to generate electromagnetic and hadronic showers.
To facilitate the training and optimisation of the GANs, and to enable a more efficient use of computational resources, a container-based system, FastCaloGANtainer, facilitates the deployment of the FastCaloGAN training on complementary high-performance resources such as High Performance Computing (HPC) farms and ensures its operational independence from the underlying system.
This talk presents the latest developments in FastCaloGAN and FastCaloGANtainer, discussing their technical details and recent improvements in terms of Physics and computational performance. For FastCaloGAN, these improvements include an improved voxelisation and extension to further use cases (e.g. particle types not yet covered), while for FastCaloGANtainer they concern its deployment on a wider variety of resources with multi-CPU/GPU nodes and different architectures (including cutting-edge HPC clusters such as Leonardo at CINECA in Bologna, Italy).
GitLab Runners have been deployed at CERN since 2015. A GitLab runner is an application that works with GitLab Continuous Integration and Continuous Delivery (CI/CD) to run jobs in a pipeline. CERN provides runners that are available to the whole GitLab instance and can be used by all eligible users. Until 2023, CERN was providing a fixed amount of Docker runners executing in OpenStack virtual machines, following an in-house, customized solution that utilized the Docker+machine executor. This solution served its purpose for several years, however needed to be reviewed due to the deprecation by Docker Inc., and only a fork maintained by GitLab Inc.
During the last few years, the demand and the number of running pipelines have substantially increased, as the adoption of Continuous Integration and Delivery has been rapidly growing.
In view of the above, CERN needed to provide a supported, scalable infrastructure that would accommodate our users' demand.
This paper describes the process of how CERN migrated from the legacy in-house solution to a new scalable, reliable and easy-to-maintain solution of runners based on Kubernetes, including the challenges and lessons learned that have been faced during this complex migration process.
Amazon S3 is a leading object storage service known for its scalability, data reliability, security and performance. It is used as a storage solution for data lakes, websites, mobile applications, backup, archiving and more. With its management features, users can optimise data access to meet specific requirements and compliance standards. Given its popularity, many tools utilise the S3 interfaces. To enhance CERN’s EOS Big Data storage, we are integrating an S3 interface into XRootD that is customised for EOS. This article describes the design, progress and future plans for the integration of the S3 API.
Since 2016, CERN has been using the OpenShift Kubernetes Distribution to host a platform-as-a-service (PaaS). This service is optimized for hosting web applications and has grown to tens of thousands of individual websites. By now, we have established a reliable framework that deals with varied use cases: thousands of websites per ingress controller (8K+ hostnames), handling with long-lived connections (30K+ concurrent sessions) and high traffic applications (25TB+ per day).
This session will discuss:
Reinforcement Learning is emerging as a viable technology to implement autonomous beam dynamics setup and optimization in particle accelerators. A Deep Learning agent can be trained to efficiently explore the parameter space of an accelerator control system and converge to the optimal beam setup much faster than traditional methods. Training these models requires programmatic execution of a high volume of simulations. This contribution introduces pytracewin, a Python wrapper of the TraceWin beam dynamics simulator, which exposes simple methods to run simulations and retrieve results. It can be easily combined with the large Python ecosystem of Machine Learning and Reinforcement Learning libraries to develop optimization models. Still, the training process is computationally constrained by the number of simulations that can be run in a reasonable time. It is thus crucial to scale such workload on a dedicated computing infrastructure while retaining a simple high-level user interface.
We exploit Ray, an open-source library, to enable embarrassingly parallel execution of TraceWin simulations on Kubernetes, using a dynamically scalable number of workers and requiring minimal user code modifications. Workers are instantiated with a custom docker image combining Ray and pytracewin. The approach is validated using two Kubernetes clusters on INFN Cloud and CloudVeneto to simulate the ADIGE beam line at Legnaro National Laboratories.
In ATLAS and other high-energy physics experiments, the integrity of Monte-Carlo (MC) simulations is crucial for reliable physics analysis. The continuous evolution of MC generators necessitates regular validation to ensure the accuracy of simulations. We introduce an enhanced validation framework incorporating the Job Execution Monitor (JEM) resulting in the established Physics Modeling Group (PMG) Architecture for Validating Evgen with Rivet (PAVER). This setup automates the validation process, facilitating systematic evaluation of MC generator updates and their compliance with experimental data.
This approach allows for early detection of discrepancies in simulation outputs, ensuring that potential issues and bugs are addressed before the production of large-scale samples for the ATLAS collaboration. MC generator Validation is specially imoprtant to save energy and money and to reduce the carbon footprint in future simulation campaigns significantly which aligns very well with the importance of reaching sustainability within ATLAS. The result is a streamlined, robust, and accessible validation system that supports sustainable MC production in ATLAS.
This presentation will summarize the implementation of PAVER by highlighting its impact on enhancing simulation reliability and efficiency. It will furthermore include an overview of the massive validation program throughout the past years resulting in many successfully validated generator and software updates. In addition, this talk will present insights into the challenges and solutions in MC generator validation, with implications for future developments in high-energy physics simulations.
The Jiangmen Underground Neutrino Observatory (JUNO), located in Southern China, is a multi-purpose neutrino experiment that consists of a central detector, a water Cherenkov detector and a top tracker. The primary goal of the experiment is to determine the neutrino mass ordering (NMO) and precisely measure neutrino oscillation parameters. The central detector contains 20,000 ton liquid scintillator and is instrumented with 17,612 20-inch PMTs and 25,600 3-inch PMTs for anti-neutrino detection with an energy resolution of 3% at 1MeV. The electronics simulation is the crucial module of JUNO offline software (JUNOSW). It takes the photoelectron information from Geant4 based detector simulation as input to simulate the PMT response, trigger logic and electronics response of sub-detectors using an implementation based on SNiPER managed dynamically- loadable elements(DLE). Electronics simulation incorporates a “hit-level” event mixing implementation which combines different event types with different rates that mimic the data streaming of real experimental data. The event mixing uses a “pull” based workflow using SNiPER incident schema. The electronics simulation outputs become inputs to the online event classification algorithms (OEC) used for event tagging and then saved to file using ROOT I/O services. In this talk, a detailed introduction of the electronics simulation software will be presented.
The ATLAS experiment at the LHC at CERN uses a large, distributed trigger and
data acquisition system composed of many computing nodes, networks, and
hardware modules. Its configuration service is used to provide descriptions of
control, monitoring, diagnostic, recovery, dataflow and data quality
configurations, connectivity, and parameters for modules, chips, and channels
of various online systems, detectors, and the whole ATLAS experiment. Those
descriptions have historically been stored in more than one thousand
interconnected XML files, which are updated by various experts many times per
day. Maintaining error-free and consistent sets of such files and providing
reliable and fast access to current and historical configurations is a major
challenge. This paper gives details of the configuration service upgrade on the
modern git version control system backend for LHC Run 3 and its exploitation
experience. It may be interesting for developers using human-readable file
formats, where consistency of the files, performance, access control,
traceability of modifications, and effective archiving are key requirements.
The LHCb detector, a multi-purpose detector with a main focus on the study of hadrons containing b- and c-quarks, has been upgraded to enable precision measurements at an instantaneous luminosity of $2\times10^{33}cm^{-2}s^{-1}$ at $\sqrt{s}=14$ TeV, five times higher than the previous detector capacity. With the almost completely new detector, a software-only trigger system has been developed and all track reconstruction algorithms have been redesigned.
The knowledge of the track reconstruction efficiency at different momenta and regions of the detector is essential for many analyses including cross-section and asymmetry measurements. A tag-and-probe method is developed to estimate the tracking efficiency using muonic tracks from $J/\psi\rightarrow\mu^+\mu^-$ decays, where the probe tracks are reconstructed excluding hits from the tracking subdetectors under scrutinity.
A complementary method is exploited to address tracking efficiency corrections due to the hadronic interactions with the detector material using pions from $D^0\rightarrow K\pi$ and $D^0\rightarrow K\pi\pi\pi$ decays. In this talk, these data-driven methods and their applications to the data taken in 2023 and 2024 are presented.
CERNBox is an innovative scientific collaboration platform, built using solely open-source components to meet the unique requirements of scientific workflows. Used at CERN for the last decade, the service satisfies the 35K users at CERN and seamlessly integrates with batch farms and Jupyter-based services. Powered by Reva, an open-source HTTP and gRPC server written in Go, CERNBox has demonstrated the provision of Sync&Share capabilities on top of multiple storage systems such as EOS and CephFS, as well as enabling federated sharing with other institutions.
In this contribution, we present the evolution of CERNBox in supporting CephFS, which has been chosen as the storage system to address the Windows applications use-cases at CERN. As we are migrating out of DFS, the legacy Windows storage provided by Microsoft, and commissioning Windows Workspaces powered by CephFS, we show how CERNBox provides a flexible software stack to seamlessly integrate the Windows-based community, which includes the Engineering sector of the Organization.
We conclude by emphasizing the multiple synergies enabled by this approach. On one hand, Windows-based data-centric workflows can leverage the multi-protocol accesses (sync, web, SMB) provided by CERNBox. On the other hand, the widespread adoption of CephFS within the scientific community positions CERNBox as an out-of-the-box solution for implementing a scalable collaborative cloud storage service.
The main reconstruction and simulation software framework of the ATLAS
experiment, Athena, underwent a major change during the LHC Run 3 in the way
the configuration step of its applications is performed. The new configuration
system, called ComponentAcumulator, emphasises modularity and provides a way
for standalone execution of parts of a job, as long as the inputs are
available, which allows unit-testing of individual components or groups of
components, as well as easier debugging.
The switch to the new configuration system of the High-Level Trigger (HLT)
software, which utilises Athena algorithms for object reconstruction and
hypothesis testing, required designing a special approach to prevent disruption
of data taking during the code migration to ComponentAccumulator. An additional
challenge is brought by a large amount of HLT chains, where in many cases
copies of the same algorithm with varying configurations are used, which
significantly increases the number of configured parameters compared to offline
reconstruction jobs.
This report describes migration of the HLT software to ComponentAccumulator
along with further improvements in the data acquisition introduced for Run 3
data taking.
Data and Metadata Organization, Management and Access
ATLAS is participating in the WLCG Data Challenges, a bi-yearly program established in 2021 to prepare for the data rates of the High Luminosity HL-LHC. In each challenge, transfer rates are increased to ensure preparedness for the full rates by 2029. The goal of the 2024 Data Challenge (DC24) was to reach 25% of the HL-LHC expected transfer rates, with each experiment deciding how to execute the challenge based on agreed general guidelines and common dates. The ATLAS challenge was designed to test the ATLAS distributed infrastructure across 66 sites and was carried out over 12 days, with increasing rates and more complex transfer topologies, putting significant strain on the system. It was also the first time the new OAuth 2.0 authorization system was tested at such a large scale. This paper will discuss the planning of the challenge, the tools used to execute it, the agreed-upon transfer rates for the connections, and finally, the achieved results and any unachieved goals, along with an analysis of the bottlenecks. We will then describe how the challenge itself was executed, the results obtained, and the lessons learned. Finally, we will look ahead to the next challenge, currently scheduled for 2026, with 50% of HL-LHC rates.
To verify the readiness of the data distribution infrastructure for the HL-LHC, which is planned to start in 2029, WLCG is organizing a series of data challenges with increasing throughput and complexity. This presentation addresses the contribution of CMS to Data Challenge 2024, which aims to reach 25% of the expected network throughput of the HL-LHC. During the challenge CMS tested various network flows, from the RAW data distribution to the "flexible" model, which adds network traffic resulting from data reprocessing and MC production between most CMS sites.
The overall throughput targets were met on the global scale utilizing several hundred links. Valuable information was gathered regarding scaling capabilities of key central services such as Rucio and FTS. During the challenge about half of the transferred volume was carried out via token based authentication. In general sufficient performance of individual links was observed and sites coped with the target throughput. For links that did not reach the target, attempts were made to identify the bottleneck, whether in the transfer tools, the network link, the involved storage systems or any other component.
ALICE introduced ground-breaking advances in data processing and storage requirements and presented the CERN IT data centre with new challenges with the highest data recording requirement of all experiments. For these reasons, the EOS O2 storage system was designed to be cost-efficient, highly redundant and maximise data resilience to keep data accessible even in the event of unexpected disruptions or hardware failures. With 150 PB of usable storage space, EOS O2 is now the largest disk storage system in use at CERN. We will report on our experience and the effectiveness of operating this full production system in Run-3 and during the LHC heavy-ions run and on how this will help in paving the road towards the data deluge coming with Hi-Luminosity LHC. In particular, we will report on our experience with RS(10+2) erasure coding in production, the achievable performance of EOS O2, reliability figures, life cycle management, capacity extension and rebalancing operations.
High-Energy Physics (HEP) experiments rely on complex, global networks to interconnect collaborating sites, data centers, and scientific instruments. Managing these networks for data-intensive scientific projects presents significant challenges because of the ever-increasing volume of data transferred, diverse project requirements with varying quality of service needs, multi-domain infrastructure, WAN distances, and limited visibility into network traffic flows. This lack of visibility hinders network operators' ability to understand actual user behavior across different network segments, optimize performance, undertake effective traffic engineering and shaping, and effectively debug and troubleshoot issues.
This project addresses these challenges by focusing on improving network visibility through standardized packet marking and flow labeling techniques. We present the Scitags initiative, a collaborative effort formed within the Research Networking Technical Working Group (RNTWG) in 2020. Scitags aims to develop a generic framework and standards for identifying the owner and associated scientific activity of network traffic. This framework extends beyond HEP/WLCG experiments, and it has a potential to benefit all global communities using Research and Education (R&E) networks.
The presentation will detail the current state of the Scitags initiative, including the evolving framework, the underlying technologies being explored (e.g., eBPF, IPv6, HbH, etc.), and the roadmap for production deployment within R&E networks. By enabling improved network visibility, Scitags will empower network operators to optimize performance, troubleshoot issues more effectively, and ultimately support the growing needs of data-intensive scientific collaborations.
To address the needs of forthcoming projects such as the Square Kilometre Array (SKA) and the HL-LHC, there is a critical demand for data transfer nodes (DTNs) to realise O(100)Gb/s of data movement. This high-throughput can be attained through combinations of increased concurrency of transfers and improvements in the speed of individual transfers. At the Rutherford Appleton Laboratory (RAL), the UK's Tier-1 centre for the Worldwide LHC Computing Grid, and initial site for the UK SKA Regional Centre (SRC), we have provisioned 100GbE XRootD servers in preparation for SKA development and operations. This presentation details the efforts undertaken to reach 100Gb/s data ingress and egress rates using the WebDAV protocol through XRootD endpoints, including the use of a novel XRootD plug-in designed to asses XRootD performance independently of physical storage backend. Results are also presented for transfer tests against a CephFS storage backend under different configuration settings (e.g. via tunings to file layouts). We discuss the challenges encountered, bottlenecks identified, and insights gained, along with a description of the most effective solutions developed to date and areas of future activities.
To address the need for high transfer throughput for projects such as the LHC experiments, including the upcoming HL-LHC, it is important to make optimal and sustainable use of our available capacity. Load balancing algorithms play a crucial role in distributing incoming network traffic across multiple servers, ensuring optimal resource utilization, preventing server overload, and enhancing performance and reliability. At the Rutherford Appleton Laboratory (RAL), the UK's Tier-1 centre for the Worldwide LHC Computing Grid (WLCG), we started with a DNS round robin then moved to XRootD's cluster management service component, which has an active load balancing algorithm to distribute traffic across 26 servers, but encountered its limitations when the system as a whole is under heavy load. We describe our tuning of the configuration of the existing algorithm before proposing a new tuneable, dynamic load-balancer based on a weighted random selection algorithm.
Online and real-time computing
The CBM experiment, currently being constructed at GSI/FAIR, aims to investigate QCD at high baryon densities. The CBM First-level Event Selector (FLES) serves as the central event selection system of the experiment. It functions as a high-performance computer cluster tasked with the online analysis of physics data, including full event reconstruction, at an incoming data rate which exceeds 1 TByte/s.
The CBM detector systems operate in a free-running and self-triggered manner, delivering time-stamped data streams. Without inherent event separation, timeslice building replaces global event building. The FLES HPC system integrates data from around 5000 input links into self-contained, overlapping processing intervals and distributes these to the compute nodes.
Using a combination of RDMA and zero-copy techniques, timeslices can be built efficiently over a high-throughput InfiniBand network and distributed to available online computing resources for a full online event reconstruction and analysis in a heterogeneous HPC cluster system. A new IPC online interface to timeslice data utilizes a Posix shared memory governed by a reference-counting item distributor. This design combines maximum performance and flexibility with minimum memory consumption. These new developments have already been successfully field-tested in production at the CBM predecessor experiment mCBM at the GSI/FAIR SIS18.
This work is supported by BMBF (05P21RFFC1).
The High-Luminosity Large Hadron Collider (HL-LHC), scheduled to start
operating in 2029, aims to increase the instantaneous luminosity by a factor of
10 compared to the LHC. To match this increase, the ATLAS experiment has been
implementing a major upgrade program divided into two phases. The first phase
(Phase-I), completed in 2022, introduced new trigger and detector systems that
have been used during the Run 3 data taking period which began in July 2022.
These systems have been used in conjunction with the new Data Acquisition (DAQ)
Readout system, based on a software application called Software Readout Driver
(SW ROD). SW ROD receives and aggregates data from the front-end electronics
via the Front-End Link eXchange (FELIX) system and passes aggregated data
fragments to the High-Level Trigger (HLT) system. During Run 3, SW ROD operates
in parallel with the legacy Readout System (ROS) at an input rate of 100 kHz.
For the Phase-II, the legacy ROS will be completely replaced with a new system
based on the next generation of FELIX and an evolution of the SW ROD
application called Data Handler. Data Handler has the same functional
requirements as SW ROD but must be able to operate at an input rate of 1 MHz.
To facilitate this evolution the SW ROD has been implemented using plugin
architecture.
This contribution presents the design and implementation of the SW ROD
application for Run 3, along with the strategy for its evolution to the
Phase-II Readout system. It discusses the lessons learned during Run 3 and
describes the challenges that have been addressed to accomplish the demanding
performance requirements of HL-LHC.
The data acquisition (DAQ) system stands as an essential component within the CMS experiment at CERN. It relies on a large network system of computers with demanding requirements on control, monitoring, configuration and high throughput communication. Furthermore, the DAQ system must accommodate various application scenarios, such as interfacing with external systems, accessing custom electronics devices for data readout, and event building. We present a versatile and highly modular programmable C++ framework designed for crafting applications tailored to various needs, facilitating development through the composition and integration of modules to achieve the desired DAQ capabilities. This framework takes advantage of reusable components and readily available off-the-shelf technologies. Applications are structured to seamlessly integrate into a containerized ecosystem, where the hierarchy of components and their aggregation is specified to form the final deployable unit to be used across multiple computers or nodes within an orchestrating environment. The utilization of the framework, along with the containerization of applications, enables coping with the complexity of implementing the CMS DAQ system by providing standardized structures and components to achieve a uniform and consistent architecture.
The CBM First-level Event Selector (FLES) serves as the central data processing and event selection system for the upcoming CBM experiment at FAIR. Designed as a scalable high-performance computing cluster, it facilitates online analysis of unfiltered physics data at rates surpassing 1 TByte/s.
As the input to the FLES, the CBM detector subsystems deliver free-streaming, self-triggered data to the common readout interface (CRI), which is a custom FPGA PCIe board installed in the FLES entry nodes. A subsystem-specific part of the FPGA design time-partitions the input streams into context-free packages. The FLES interface module (FLIM), a component of the FPGA design, acts as the interface between the subsystem-specific readout logic and the generic FLES data distribution. It transfers the packed detector data to the host's memory using a low-latency, high-throughput PCIe DMA engine. This custom design enables a shared-memory-based, true zero-copy data flow.
A fully implemented FLIM for the CRI board is currently in use within CBM test setups and the FAIR Phase-0 experiment mCBM. We present an overview of the FLES input interface architecture and provide performance evaluations under synthetic as well as real-world conditions.
This work is supported by BMBF (05P21RFFC1).
The ATLAS experiment at the Large Hadron Collider (LHC) at CERN continuously
evolves its Trigger and Data Acquisition (TDAQ) system to meet the challenges
of new physics goals and technological advancements. As ATLAS prepares for the
Phase-II Run 4 of the LHC, significant enhancements in the TDAQ Controls and
Configuration tools have been designed to ensure efficient data collection,
processing, and management. This abstract presents the evolution of ATLAS TDAQ
Controls and Configuration system leading up to Phase-II Run4. As part of the
evolution towards Phase-II, Kubernetes has been chosen to orchestrate the Event
Filter farm. By leveraging Kubernetes, ATLAS can dynamically allocate computing
resources, scale processing capacity in response to changing data taking
conditions, and ensure high availability of data processing services. The
integration of the Kubernetes with the TDAQ Run Control framework enables
perfect synchronisation between the experiment's data acquisition components
and the computing infrastructure. We will discuss the architectural
considerations and implementation challenges involved in Kubernetes integration
with the ATLAS TDAQ controls and configuration system. We will highlight the
benefits of using Kubernetes as an event filter farm orchestrator, including
improved resource utilization, enhanced fault tolerance, and simplified
deployment and management of data processing workflows. In addition, we will
report on the extensive testing of Kubernetes that was conducted using a farm
of 2500 servers within the experiment data taking environment, demonstrating
its scalability and robustness in handling the demands of the ATLAS TDAQ system
for Phase-II. The adoption of Kubernetes represents a significant step forward
in the evolution of ATLAS TDAQ controls and configuration system, aligning with
industry best practices in container orchestration and cloud-native computing.
The DarkSide-20k detector is now under construction in the Gran Sasso National Laboratory (LNGS) in Italy, the biggest underground physics facility. It is designed to directly detect dark matter by observing weakly interacting massive particles (WIMPs) scattering off the nuclei in 20 tonnes of underground-sourced liquid argon in the dual-phase time projection chamber (TPC). Additionally two layers of veto detectors allow operating with virtually zero instrumental background in the region of interest, leaving only irreducible neutrino interaction. When operating the DarkSide-20k experiment is expected to lead the field for high mass WIMPs searches in the next decade and due to the low background will have a high discovery potential. Thanks to its size and sensitivity the detector will allow a broad physics program including supernova neutrino detection.
The light generated during the interactions in the liquid argon is detected by custom silicon photomultipliers (SiPMs) assemblies of size 20 cm by 20 cm. The units installed in the veto detectors are equipped with application specific integrated circuits (ASICs) coupled to SiPMs allowing linear signal response up to 100 photons and signal to noise ratio of 6 for a single photon, while those for the TPC employ a discrete element front-end with similar performances.
The data acquisition system (DAQ) for the DarkSide-20k experiment is designed to acquire signals from the 2720 channels of these photosensors in a triggerless mode. The data rate from the TPC alone is expected to be at the level of 2.5 GB/s and will be acquired by 36 newly available commercial VX2745 CAEN 16 bit, 125 MS/s, high channel density (64 ch.) waveform digitizers. The Veto detector is readout by an additional 12 modules. The data is first transferred to 24 Frontend Processor machines for filtering and reduction. Finally the data stream is received by another set of Time Slice Processor computers where the whole detector data is assembled in fixed length time series, analysed and stored for offline use. These operations will be supervised by a Maximum Integration Data Acquisition System (MIDAS) developed in the Paul Scherrer Institute in Switzerland and TRIUMF laboratory in Canada.
Offline Computing
With the increasing amount of optimized and specialized hardware such as GPUs, ML cores, etc. HEP applications face the opportunity and the challenge of being enabled to take advantage of these resources, which are becoming more widely available on scientific computing sites. The Heterogenous Frameworks project aims at evaluating new methods and tools for the support of both heterogeneous computational nodes and and multi-node workloads. Based on the experience from the parallel frameworks of the LHC experiments and their ad-hoc support for heterogeneous resources, this project investigates newer libraries and languages that have been developed after the move to parallel frameworks about a decade ago.
This paper will summarize the scope of the problem being tackled, the state of the art of heterogeneous libraries, and the benchmark infrastructure used for the R&D activities. We will as well present some of the tooling developed to extract the benchmark scenarios from existing LHC experiment workflows. First results of using both newer C++ and Julia libraries for parallel execution will be shown.
The large increase in luminosity expected from Run 4 of the LHC presents the ATLAS experiment with a new scale of computing challenge, and we can no longer restrict our computing to CPUs in a High Throughput Computing paradigm. We must make full use of the High Performance Computing resources available to us, exploiting accelerators and making efficient use of large jobs over many nodes.
Here we show our current developments in introducing these capabilities to Athena, ATLAS’s general software framework. We will show how we have used MPI to distribute processing over multiple nodes, and how this can be used to run real ATLAS jobs from the Grid on an HPC. We will also show how we have integrated a first-class capability to offload work to an accelerator without blocking the CPU, by making use of suspendable lightweight threads, and an example of how this capability can be used in a real workload.
To achieve better computational efficiency and exploit a wider range of computing resources, the CMS software framework (CMSSW) has been extended to offload part of the physics reconstruction to NVIDIA GPUs. To support additional back-ends, as well to avoid the need to write, validate and maintain a separate implementation of the reconstruction algorithms for each back-end, CMS has adopted the Alpaka performance portability library.
Alpaka (Abstraction Library for Parallel Kernel Acceleration) is a header-only C++ library that provides performance portability across different back-ends, abstracting the underlying levels of parallelism. It supports serial and parallel execution on CPUs, and extremely parallel execution on NVIDIA, AMD and Intel GPUs.
This contribution will show how Alpaka is used in the CMS software to develop and maintain a single code base; to use different toolchains to build the code for each supported back-end, and link them into a single application; to seamlessly select the best back-end at runtime, and implement portable reconstruction algorithms that run efficiently on CPUs and GPUs from different vendors. It will describe the validation and deployment of the Alpaka-based implementation in the CMS High Level Trigger, and highlight how it achieves near-native performance.
As the Large Hadron Collider progresses through Run 3, the LHCb experiment has made significant strides in upgrading its offline analysis framework and associated tools to efficiently handle the increasing volumes of data generated. Numerous specialised algorithms have been developed for offline analysis, with a central innovation being FunTuple--a newly developed component designed to effectively compute and store offline data. Built upon the robust Gaudi functional framework, FunTuple merges a user-friendly Python interface with a flexible templated design. This modern architecture supports a wide range of data types, including both reconstructed and simulated events, facilitating processing of event-level and decay-level information. Crucially, FunTuple is primed for future enhancements to integrate new event models, optimising vectorised data processing across heterogeneous resources.
A pivotal feature of FunTuple is its capability to align trigger-computed observables with those analysed offline, crucial for maintaining data integrity across LHCb analyses. This alignment is achieved through Throughput Oriented (ThOr) functors, specifically crafted to meet the high throughput demands of the trigger system. Moreover, FunTuple offers comprehensive customisation options, enabling users to define and store tailored observables within ROOT files in anticipation of future increases in data volumes. FunTuple has undergone rigorous testing, including numerous unit tests and pytest evaluations. In 2024, it is undergoing a comprehensive stress test by hundreds of analysts to validate its reliability in managing and validating the quality of data recorded by LHCb.
This presentation will delve into the design, user interface, and integration of FunTuple alongside other analysis components, showcasing their efficiency and reliability through detailed performance metrics in managing large-scale data.
We summarize the status of the Deep Underground Neutrino Experiment (DUNE) software and computing development. The DUNE Collaboration has been successfully operating the DUNE prototype detectors at both Fermilab and CERN, and testing offline computing services, software, and infrastructure using the data collected. We give an overview of results from end-to-end testing of systems needed to acquire, catalog, reconstruct, simulate and analyze the beam data from ProtoDUNE Horizontal Drift (PDHD) and Near Detector 2x2 Demonstrator, and cosmic data from ProtoDUNE Vertical Drift (PDVD). These tests included reconstruction and simulation of data from all prototype detector runs utilizing a variety of distributed computing and HPC resources. The results of these studies help define the development path of DUNE core software and computing to support the physics goals of precision measurements of neutrino oscillation parameters, detection of astrophysical neutrinos, measurement of neutrino interaction properties and searches for physics beyond the Standard Model. The data from the full DUNE far and near detectors, expected in 2029 and 2031 respectively, will present significant challenges in terms of data product memory management, optimized use of parallel processing for reconstruction and simulation, and management of large individual trigger data volumes. DUNE will present plans for future development to accommodate the requirements of the larger DUNE far and near detectors, and the timeline for future data challenges leading to data taking at the end of the decade.
Since the mid-2010s, the ALICE experiment at CERN has seen significant changes in its software, especially with the introduction of the Online-Offline (O²) computing system during Long Shutdown 2. This evolution required continuous adaptation of the Quality Control (QC) framework responsible for online Data Quality Monitoring (DQM) and offline Quality Assurance (QA).
After a general overview of the system, this talk delves into the evolving user requirements that shaped the QC framework from its initial prototyping phase to its current state. We will explore the changing landscape of performance needs and feature demands, highlighting which initial requirements persisted, which emerged later, and which features ultimately proved unnecessary.
Additionally, we will trace the framework's development in relation to other software components within the ALICE ecosystem, offering valuable insights and lessons learned throughout the process. Finally, we will also discuss the challenges encountered in balancing development team resources with the evolving project scope.
Simulation and analysis tools
The ATLAS Fast Chain represents a significant advancement in streamlining Monte Carlo (MC) production efficiency, specifically for the High-Luminosity Large Hadron Collider (HL-LHC). This project aims to simplify the production of Analysis Object Data (AODs) and potentially Derived Analysis Object Data (DAODs) from generated events with a single transform, facilitating rapid reproduction of the entire MC dataset multiple times per year. By eliminating intermediate formats and optimizing CPU utilization, the Fast Chain offers substantial savings in disk space while staying within the CPU budget by employing fast simulation methodologies instead of full MC campaigns. Central to the success of the Fast Chain is the seamless integration of fast simulation and reconstruction techniques. Leveraging AtlFast3 methodologies for efficient calorimeter shower simulation and employing Fast Track Simulation (FATRAS) for charged particles in the Inner Detector, the project aims at accelerated processing without compromising accuracy. Notably, muon simulations rely on Geant4 due to minimal CPU overhead. Pileup effects are incorporated through MC overlay, with potential future integration of data overlay. Reconstruction speed optimization focuses on Inner Detector track reconstruction. Strategies such as dedicated reconstruction configurations and track overlay from pre-mixed pileup datasets are being explored. In summary, the ATLAS Fast Chain project demonstrates a paradigm shift in MC production methodologies, offering a scalable and efficient solution tailored to the demands of the HL-LHC era. This abstract provides an overview of the project's objectives, methodologies, and ongoing developments, showcasing its potential to revolutionize MC production within the ATLAS experiment.
Simulation of physics processes and detector response is a vital part of high energy physics research but also representing a large fraction of computing cost. Generative machine learning is successfully complementing full (standard, Geant4-based) simulation as part of fast simulation setups improving the performance compared to classical approaches.
A lot of attention has been given to calorimeters being the slowest part of the full simulation, but their speed becomes comparable with silicon semiconductor detectors when fast simulation is used. This makes silicon detectors the next candidate to make faster, especially with the growing number of channels in future detectors.
This work studies the use of transformer architectures for fast silicon tracking detector simulation. The OpenDataDetector is used as a benchmark detector. Physics performance is estimated comparing reconstructed tracks using the ACTS tracking framework between full simulation and machine learning one.
Celeritas is a rapidly developing GPU-enabled detector simulation code aimed at accelerating the most computationally intensive problems in high energy physics. This presentation will highlight exciting new performance results for complex subdetectors from the CMS and ATLAS experiments using EM secondaries from hadronic interactions. The performance will be compared on both Nvidia and AMD GPUs as well as multicore CPUs, made possible by a new native Celeritas geometry representation of Geant4 geometry objects. This new surface-based geometry, ORANGE, provides a robust and efficient navigation engine fundamentally different from existing detector simulation models. Finally, we introduce two new physics capabilities to Celeritas, optical photon tracking and extended EM models, that demonstrate the code's extensibility and promise potential applications beyond LHC detectors.
An important alternative for boosting the throughput of simulation applications is to take advantage of accelerator hardware, by making general particle transport simulation for high-energy physics (HEP) single-instruction-multiple-thread (SIMT) friendly. This challenge is not yet resolved due to difficulties in mapping the complexity of Geant4 components and workflow to the massive parallelism features exposed by graphics processing units (GPU). The AdePT project is one of the R&D initiatives tackling this limitation and exploring GPUs as potential accelerators for offloading part of the CPU simulation workload. Our main target is the implementation of a complete electromagnetic shower transport engine working on the GPU. A first development phase, allowed us to verify our GPU prototype against the Geant4 simulation for both simplified and complex setups, and to test different Geant4 integration strategies. We have simplified the integration procedure of AdePT as an external library in both standalone applications and experimental frameworks through standard Geant4 mechanisms. The project's current main focus is to provide solutions for the main performance bottlenecks identified so far: inefficient geometry modeling for the GPUs, and a suboptimal CPU-GPU scheduling strategy. We will present the most recent results and conclusions of our work, focusing on the hybrid Geant4-AdePT use case.
The demands for Monte-Carlo simulation are drastically increasing with the high-luminosity upgrade of the Large Hadron Collider, and expected to exceed the currently available compute resources. At the same time, modern high-performance computing has adopted powerful hardware accelerators, particularly GPUs. AdePT is one of the projects aiming to address the demanding computational needs by leveraging these heterogeneous compute architectures. While AdePT has successfully ported realistic detector simulations to GPUs using the VecGeom library, the complexity of geometry modeling emerged as a bottleneck. Thread divergence and high register usage were impeding the GPU performance. Therefore, a new, GPU-friendly surface-based model has been introduced in the VecGeom library that decomposes the divergent code of the 3D primitive solids into simpler and more balanced surface algorithms. In this work, we present the latest performance results, in particular on complex setups like the CMS Phase-2 geometry. Additionally, we explore techniques such as mixed precision and bounding volume hierarchies to further accelerate simulations.
Opticks is an open source project that accelerates optical photon simulation
by integrating NVIDIA GPU ray tracing, accessed via the NVIDIA OptiX API, with
Geant4 toolkit based simulations.
Optical photon simulation times of 14 seconds per 100 million photons
have been measured within a fully analytic JUNO GPU geometry
auto-translated from the Geant4 geometry when using a single NVIDIA GPU from
the first RTX generation.
Optical physics processes of scattering, absorption, scintillator reemission
and boundary processes are implemented in CUDA based on Geant4. Wavelength-dependent material and surface
properties as well as inverse cumulative distribution functions for reemission
are interleaved into GPU textures providing fast interpolated property lookup
or wavelength generation. In this work we describe the application of Opticks
to JUNO simulation including new Opticks features that improve performance for
complex CSG shapes and torus solids.
Collaborative software and maintainability
The LHCb Software Framework Gaudi has been developed in C++ since 1998. Over the years it evolved following the changes in the C++ established best practices and the evolution of the C++ standard, even reaching the point of enabling the development of multi-threaded applications.
In the past few years there has been several announcements and debates over the so called C++ successor languages and safe alternatives to C++, with Rust leading the way as an example of safe and performing language that can replace C and C++ in a number of cases.
This paper explores some ways Rust can be used to extend the Software Framework Gaudi, focusing on how one can leverage on the Rust-C++ interoperability efforts driven by the community. We show how to invoke Rust code from C++ and vice versa, and how Gaudi components could be written completely in Rust. We can use the experience gained in the exercise to evaluate possible integration with other languages or technologies, like WASM.
Recently, interest in measuring and improving the energy (and carbon) efficiency of computation in HEP, and elsewhere, has grown significantly. Measurements have been, and continue to be, made of the efficiency of various computational architectures in standardised benchmarks... but those benchmarks tend to compare only implementations in single programming languages. Similarly, comparisons of the efficiency of various languages tend to focus on a single architecture, although it is the case that some abstractions in a given language can match specific architectural choices (in, say, memory ordering strictness) better than others.
The existence of the JetReconstruction.jl project, implementing a subset of the FastJet C++ code's functionality in performant Julia, allows us to usefully compare how the relative efficiencies of implementations in the two languages are influenced by the architecture they are executed on.
We report on the results of comparing benchmarks on these codes, and others, on x86 and various aarch64 implementations, amongst others.
ROOT is a software toolkit at the core of LHC experiments and HENP collaborations worldwide, widely used by the community and in continuous development with it. The package is available through many channels that cater different types of users with different needs. This ranges from software releases on the LCG stacks provided via CVMFS for all HENP users to benefit, to pre-built binaries available on the three major platforms (Linux, MacOS, Windows), to more specialised packaging systems such as Homebrew, Snap, Anaconda. The last example is one of the main systems to distribute software to a Python user base, particularly beneficial for complex environments with real-world scientific applications in mind such as those found in HENP. Nonetheless, the standard Python implementation defaults to using pip as a package installer. This technology, together with the Python Package Index (PyPI), distributes many Python packages and has the advantage of providing a lightweight path to downstream development of a package with some upstream Python dependencies. This contribution highlights the steps required towards making pip install ROOT
possible, demonstrating its availability as an early-stage release, and discussing some of the unique challenges of delivering a highly-performant multi-language software via the standard Python packaging system.
In the vast landscape of CERN's internal documentation, finding and accessing relevant detailed information remains a complex and time-consuming task. To address this challenge, the AccGPT project proposes the development of an intelligent chatbot leveraging Natural Language Processing (NLP) technologies. The primary objective is to harness open-source Large Language Models (LLMs) to create a purpose-built chatbot for text knowledge retrieval, with the potential to serve as an assistant for code development and other features in the future.
This initiative was driven by the growing demand at CERN for access to LLMs, not only for building AI Chatbots but also for various other use cases, including Transcription and Translation as a Service (TTaaS), CDS and Zenodo Information Categorization, HR selection processes, and many others. Providing easy and efficient access to LLMs is crucial for the adoption of Generative AI across numerous processes at CERN.
A promising first prototype has already been developed in the realm of knowledge retrieval. It demonstrates a sufficient understanding of user inquiries and provides comprehensive responses utilizing a Retrieval Augmented Generation (RAG) pipeline. However, there is room for improvement to further increase the precision of the responses, which can be achieved by enhancing the retrieval pipeline, considering more powerful and larger LLMs, or fine-tuning the LLMs with more relevant scientific data.
The user interface design and overall user experience of the current prototype chatbot are being iteratively improved, and preparations are underway to make AccGPT available to the community for testing. Automated data scraping and preprocessing pipelines are also being developed to update the chatbot's knowledge base fully autonomously.
The LHCb collaboration continues to primarily utilize the Run 1 and Run 2 legacy datasets well into Run 3. As the operational focus shifts from the legacy data to the live Run 3 samples, it is vital that a sustainable and efficient system is in place to allow analysts to continue to profit from the legacy datasets. The LHCb Stripping project is the user-facing offline data-processing stage that allows analysts to select their physics candidates of interest simply using a Python-configurable architecture. After physics selections have been made and validated, the full legacy datasets are then reprocessed in small time windows known as Stripping campaigns.
Stripping campaigns at LHCb are characterized by a short development window with a large portion of collaborators, often junior researchers, directly developing a wide variety of physics selections; the most recent campaign dealt with over 900 physics selections. Modern organizational tools, such as GitLab Milestones, are used to track all of the developments and ensure the tight schedule is adhered to by all developers across the physics working groups. Additionally, continuous integration is implemented within GitLab to run functional tests of the physics selections, monitoring rates and timing of the different algorithms to ensure operational conformity. Outside of these large campaigns the project is also subject to nightly builds, ensuring the maintainability of the software when parallel developments are happening elsewhere.
I will be presenting the history of the design, implementation, testing, and release of the production version of a C++-based software for the Gas Gain Stabilization System (GGSS) used in the TRT detector at the ATLAS experiment. This system operates 24/7 in the CERN Point1 environment under the control of the Detector Control System (DCS) and plays a crucial role in delivering reliable data during the LHC’s stable beams.
The uniqueness of this software lies in its initial release around 2004, followed by subsequent refactoring, improvements, and implementation for the Run1 period of the LHC in 2008. Another significant change occurred during Long Shutdown 1 when the operating system transitioned from Windows to Linux for Run2 in 2015. More recently, there have been frequent updates and upgrades to the operating system and external libraries.
My aim is to present the evolution of the software, highlighting changes introduced from an external perspective due to shifts in the environment or requirements. Additionally, I’ll discuss the evolution of the C++ standard, compiler changes, security considerations, and modifications to the build and test environment. During the conference, I will focus on the most compelling and significant milestones, as well as key aspects relevant to the lifecycle of this software.
Computing Infrastructure
A robust computing infrastructure is essential for the success of scientific collaborations. However, smaller or newly founded collaborations often lack the resources to establish and maintain such an infrastructure, resulting in a fragmented analysis environment with varying solutions for different members. This fragmentation can lead to inefficiencies, hinder reproducibility, and create challenges for the collaboration.
We present an analysis facility for the DARWIN (DARk matter WImp search with liquid xenon) observatory, a new experiment that is currently in its R&D phase. The facility is designed to be lightweight with minimal administrative overhead while providing a common entry point for all DARWIN collaboration members. The setup serves as a blueprint for other collaborations, that want to provide a common analysis facility for their members. Grid computing and storage resources are integrated into the facility, allowing for distributed computing and a common entry point for storage. The authentication and authorization infrastructure for all services is token-based, using an Indigo IAM instance.
This talk will discuss the architecture of the facility, its provided services, first experiences of the DARWIN collaboration, and how it can serve as a sustainable blueprint for other collaborations.
BaBar stopped data taking in 2008 but its data is still analyzed by the collaboration. In 2021 a new computing system outside of the SLAC National Accelerator Laboratory was developed and major changes were needed to keep the ability to analyze the data by the collaboration, while the user facing front ends all needed to stay the same. The new computing system was put in production in 2022 and we will describe its unique infrastructure, based on cloud compute in Victoria, Canada, data storage at GridKa, Germany, streaming data access, as well as the possibility to analyze any data from anywhere. We will show advantages of the current system and how to run an old and outdated OS in current infrastructures, complications we faced when developing the system, as well as our experience in running and using it for about 2 years. It may be of interest to other groups and experiments when planing data preservation with the ability to continue to analyze the data, even decades after data taking has stopped.
Although wireless IoT devices are omnipresent in our homes and workplaces, their use in particle accelerators is still uncommon. Although the advantages of movable sensors communicating over wireless networks are obvious, the harsh radiation environment of a particle accelerator has been an obstacle to the use of such sensitive devices. Recently, though, CERN has developed a radiation-hard LoRaWAN based platform that can be adapted to support multiple sensors.
We report here on this platform, the deployment of an LPWAN network based on LoRaWAN technology in the underground areas at CERN, the infrastructure and tools developed to support device integration and data collection, and, finally, on some of the positive benefits that have been delivered through the use of these sensors in CERN’s accelerator complex.
The modern data centers provide the efficient Information Technologies (IT) infrastructure needed to deliver resources,
services, monitoring systems and collected data in a timely fashion. At the same time, data centres have been continuously
evolving, foreseeing large increase of resources and adapting to cover multifaced niches.
The CNAF group at INFN (National Institute for Nuclear Physics) has implemented a Big Data Platform (BDP)
infrastructure, designed for the collection and the indexing of log reports from CNAF facilities.
The infrastructure is an ongoing project at CNAF and it is at service of the Italian groups working in high energy physics
experiments. Within this framework, the first data pipeline was established for the ATLAS experiment, using input from the
ATLAS Distributed Computing system PanDa.
This pipeline focuses on the ATLAS computational job data processed by the Italian INFN Tier-1 computing farm. The system
has been operational and effective for several years, marking our initiative as the first to integrate job information
directly with the infrastructure. Following the finalization of data transmission, our objective is to conduct an analysis
and surveillance of the PanDA Jobs data. This will involve examining the performance metrics of the machines and identifying
the log errors that lead to job failures.
DESY operates multiple dCache storage instances for multiple communities. As each community has different workflows and workloads, their dCache installations range from very large instances with more than 100 PB of data, to instances with up to billions of files or instances with significant LAN and WAN I/O.
To successful operate all instances and quickly identify issues and performance bottlenecks, DESY IT relies for monitoring heavily on dCache own storage events. Each atomic operation in the distributed storage instances trigger a storage event with details to the corresponding transfer or service status change.
These events are collected and parsed through an Apache Kafka event streaming bus. From the Kafka event stream, the events are aggregated in an Elastic Search+Lucene based database and search engine for on the fly operational diagnostics and analytics. Beyond day to day operations, an on demand Apache Spark cluster on top the National Analysis Facility at DESY is used for in detail analyses of operational data to extract information over a wide time span and number of storage events. In a similar fashion, all dCache logging messages are also processed through Kafka stream allowing to employ a passive monitoring waiting for specific signature to raise an alarm. In the future ML and AI algorithms for predictive maintenance are in the development pipeline. Furthermore, additional matrices are collecting from the dCache pools themselves and also pushed to Kafka to generate an almost complete picture of the dCache instances.
In this talk, we present our aggregation and analyses pipelines and workflows and how they are enabling DESY IT to scale out dCache storages for heterogeneous user groups and use cases.
Queen Mary University of London (QMUL) has recently finished refurbishing its data centre that house our computing cluster supporting the WLCG project. After 20 years of operation the original data centre had significant cooling issues and increases in energy prices have all driven the need for refurbishment amid growing awareness of climate change.In addition there is a need to increase the capacity (from 150KW) to cope with the expected increased needs to the high luminosity LHC and new astronomy projects such as the LSST and SKA observatories.
A summary of the project is presented covering the project time line and solutions implemented (in row cooling, hot aisle containment, heat pumps and dry air coolers). Experiences and lessons learnt in the design, building and use of the data centre ( covering choices in power supply, rack density, storage space, floor type, lighting, monitoring, etc…) are discussed. Effects of budget constraints and project rescoping due to inflation are also discussed.
First data from the energy use and heat recovery are presented and estimates of the energy and carbon saving over time are given.
Collaboration, Reinterpretation, Outreach and Education
The Science and Technology Facilities Council (STFC), part of UK Research and Innovation (UKRI), has a rich tradition of fostering public engagement and outreach, as part of its strategic aim to showcase and celebrate STFC science, technology, and staff, both within its National Laboratories and throughout the broader community.
As part of its wider programme, STFC organised two large scale public engagement open weeks in 2023 and 2024. These events, held at the Sci-Tech Daresbury campus in the North of England and the Harwell Campus in the South of England, home to STFC's largest National Laboratories, collectively welcomed over 17,500 participants.
These open weeks provided an unparalleled opportunity for the public to intimately engage with groundbreaking science and technology. Attendees were immersed in hands-on activities, demonstrations, and enlightening talks spanning various disciplines and catering to all age groups. They also had the unique opportunity to explore the state-of-the-art facilities on site, and talk with the people who work here.
STFC's Scientific Computing Department (SCD) took the lead in orchestrating and delivering computing outreach initiatives during both open weeks. This paper details STFC's approach to organizing these open weeks, how the open weeks were structured, SCD's planning process for contributing to the events, and delves into the specifics of SCD's impactful contributions. By sharing these insights, this paper aims to offer valuable lessons for the effective execution of large-scale public engagement initiatives within the scientific community.
Since 1983 the Italian groups collaborating with Fermilab (US) have been running a 2-month summer training program for Master students. While in the first year the program involved only 4 physics students, in the following years it was extended to engineering students. Many students have extended their collaboration with Fermilab with their Master Thesis and PhD.
The program has involved almost 600 Italian students from more than 20 Italian universities. Each intern is supervised by a Fermilab Mentor responsible for the training program. Training programs spanned from Tevatron, CMS, Muon (g-2), Mu2e and SBN and DUNE design and data analysis, development of particle detectors, design of electronic and accelerator components, development of infrastructures and software for tera-data handling, quantum computing and research on superconductive elements and accelerating cavities.
In 2015 the University of Pisa included the program within its own educational programs. Summer Students are enrolled at the University of Pisa for the duration of the internship and at the end of the internship they write summary reports on their achievements. After positive evaluation by a University of Pisa Examining Board, interns are acknowledged 6 ECTS credits for their Diploma Supplement. In the years 2020 and 2021 the program was canceled due to the sanitary emergency but in 2022 it was restarted and allowed a cohort of 21 students in 2022, and a cohort of 27 students in 2023 to be trained for nine weeks at Fermilab. We are now organizing the 2024 program.
The Remote^3 (Remote Cubed) project is an STFC Public Engagement Leadership Fellowship funded activity, organised in collaboration between the University of Edinburgh (UoE), and STFC’s Public Engagement Team, Scientific Computing Department, and Boulby Underground Laboratory – part of STFC Particle Physics.
Remote^3 works with school audiences to challenge teams of young people to design, build, and program their own LEGO Mindstorms “Mars Rover”, which will be tested at the Boulby Underground Laboratory’s Mars Yard, 1.1 km underground. Teams, with the assistance of mentors from UoE and STFC, will design their rover to complete various space-exploration themed challenges – ranging from taking a panoramic environment scan to navigating the Mars Yard landscape looking for LEGO brick samples. The project aims to engage with audiences who do not usually interact with STFC Public Engagement, such as more remote locations or areas of higher deprivation and give them the opportunity to work hands on with engineering and computing, whilst learning from and interacting with real scientists and engineers.
Since its inception in 2019, Remote^3 has flourished in a wide variety of different environments and through multiple mediums, from entirely virtual during the lockdowns of 2020-21, deep underground, in schools and storytelling at libraries, and in tents in fields at festivals.
This year Remote^3 is building on the lessons learnt through this varied programme to deliver a series of engagement activities in conjunction with STFC’s Rutherford Appleton Laboratory Public Open Week, which has an expected audience of 20,000 people.
Virtual Visits have been an integral component of the ATLAS Education and Outreach programme since their inception in 2010. Over the years, collaboration members have hosted visits for tens of thousands of visitors located all over the globe. In 2024, alone there have already been 59 visits through the month of May. Visitors in classrooms, festivals, events or even at home have a unique opportunity to engage with scientists located either underground in the ATLAS experimental cavern or in front of the control room, to learn about the goals and achievements of the collaboration. As part of the renovation of the ATLAS Visitor Centre at LHC Point 1, a new installation was constructed to facilitate Virtual Visits during the running of LHC. We present the overall programme, the new installation and discuss recent initiatives to expand our reach, including Open Visits on Zoom, Facebook, YouTube and TikTok Live.
If a physicist needs to ask for help on some software, where should they go? For a specific software package, there may be a preferred website, such as the ROOT Forum or a GitHub/GitLab Issues page, but how would they find this out? What about problems that cross package boundaries? What if they haven't found a tool that would solve their problem yet?
HEP-Help (hep-help.org) is intended as a first-stop helpline for questions about particle physics software. It is not intended to replace established venues, but redirect users to the best place to ask their questions, and possibly help them frame their questions in better ways, such as distinguishing usage questions from bug reports and constructing minimal reproducers.
This project has two parts: one technological and one social. The technical aspect involves collating existing documentation, tutorials, and forum archives to produce a dataset to train an LLM as a first responder. The social aspect involves building a community of part-time responders, people who take shifts (help-a-thons!) to correct or follow up on the LLM's initial suggestions. This community includes tutorial trainers, developers of particle physics software, and experienced users, all of whom are already invested in helping new users and stand to benefit from a more organized support system.
Large Language Models (LLMs) have emerged as a transformative tool in society and are steadily working their way into scientific workflows. Despite their known tendency to hallucinate, rendering them perhaps unsuitable for direct scientific pipelines, LLMs excel in text-related tasks, offering a unique solution to manage the overwhelming volume of information presented at large conferences such as ACAT, ICHEP, and CHEP. This poster presents an innovative open-source application that harnesses the capabilities of an LLM to rank conference abstracts based on a user’s specified interests. By providing a list of interests to the LLM, it can sift through a multitude of abstracts, identifying those most relevant to the user, effectively helping to tailor the conference experience. The LLM, in this context, serves an assistant role, aiding conference attendees in navigating the deluge of information typical of large conferences. The poster will detail the workings of this application, provide prompts to optimize its use, and discuss potential future directions for this type of application.
Place: AGH University main building A0, Mickiewicza 30 Av., Krakow
Quantum computers have reached a stage where they can perform complex calculations on around 100 qubits - referred to as Quantum Utility Era.
They are being utilized in industries such as materials science, condensed matter, and particle physics for problem exploration beyond the capabilities of classical computers. In this talk, we will highlight the progress in both IBM quantum hardware and software that allow exploring opportunities not only for large-scale applications utilizing error-mitigation, but also pave the way toward future error corrected systems within the next decade.
This year CERN celebrates its 70th Anniversary, and the 60th anniversary of Bell's theorem, a result that arguably had the single strongest impact on modern foundations of quantum physics, both at the conceptual and methodological level, as well as at the level of its applications in information theory and technology.
CERN has started its second phase of the Quantum Technology Initiative with a 5-year-term plan aligned with the CERN research and collaboration objectives. This effort is designed to build specific capacity and technology platforms and support a longer-term strategy to use quantum technology at CERN and in HEP in the future. After a preliminary introduction about the promise of quantum computing, we will discuss main research directions and results from theoretical foundations of quantum machine learning algorithms to application in several areas of HEP.
Michele Grossi, PhD https://michele-grossi.web.cern.ch
As CERN approaches the launch of the High Luminosity-LHC Large Hadron Collider (HL-LHC) by the decade’s end, the computational demands of traditional simulations have become untenably high. Projections show millions of CPU-years required to create simulated datasets - with a substantial fraction of CPU time devoted to calorimetric simulations. This presents unique opportunities for breakthroughs in computational physics. We show how Quantum-assisted Generative AI can be used for the purpose of creating synthetic, realistically scaled calorimetry dataset. The model is constructed by combining D-Wave’s Quantum Annealer processor with a Deep Learning architecture, increasing the timing performance with respect to first principles simulations and Deep Learning models alone, while maintaining current state-of-the-art data quality.
Meeting point: in front of the venue, the Auditorium Maximum of Jagiellonian University - Krupnicza 33 Street.
Recent Large Language Models like ChatGPT show impressive capabilities, e.g. in the automated generation of text and computer code. These new techniques will have long-term consequences, including for scientific research in fundamental physics. In this talk I present the highlights of the first Large Language Model Symposium (LIPS) which took place in Hamburg earlier this year. I will focus on high energy physics and will also give an outlook towards future developments and applications.
A diverse panel that will discuss the potential impact of the progress in the fields of Quantum Computing and the latest generation of Machine Learning, like LLMs. On the panel are experts from QC, LLM, ML in HEP, Theoretical Physics and large scale computing in HEP. The discussion will be moderated by Liz Sexton Kennedy from the Fermi National Accelerator Laboratory.
Data and Metadata Organization, Management and Access
The CERN Tape Archive (CTA) scheduling system implements the workflow and lifecycle of Archive, Retrieve and Repack requests. The transient metadata for queued requests is stored in the Scheduler backend store (Scheduler DB). In our previous work, we presented the CTA Scheduler together with an objectstore-based implementation of the Scheduler DB. Now with four years of experience in production, the strengths and limitations of this implementation are better understood. While the objectstore-based implementation is highly efficient for FIFO queueing operations (archive/retrieve), non-FIFO operations (delete, priority queues) require some workarounds. The objectstore backend implementation imposes constraints on how the CTA Scheduler code can be modified and is an additional software dependency and technology for developers to learn. This paper discusses an alternate Scheduler DB implementation, based on relational database technology. We include a status report and roadmap.
The latest tape hardware technologies (LTO-9, IBM TS1170) impose new constraints on the management of data archived to tape. In the past, new drives could read the previous one or even two generations of media, but this is no longer the case. This means that repacking older media to new media must be carried out on a more agressive schedule than in the past. An additional challenge is the large capacity of the newer media. A 50 TB tape can contain a vast number of files, whose metadata must be tracked during repacking. Repacking an entire tape also requires a significant amount of disk storage. At CERN Tier-0, these challenges have created new operational problems to solve, in particular contention for resources between physics archival and repack operations. This contribution details these problems and describes the various approaches we have taken to mitigate and solve them. We include a roadmap for future repack developments.
Storing the ever-increasing amount of data generated by LHC experiments is still inconceivable without making use of the cost effective, though inherently complex, tape technology. GridKa tape storage system used to rely on IBM Spectrum Protect (SP). Due to a variety of limitations and to meet the even higher requirements of HL-LHC project, GridKa decided to switch from SP to High Performance Storage System (HPSS).
Even though HPSS is a highly scalable and performant tape management software, it required special adjustments to fulfill all GridKa requirements. Based on the experience gained with the former tape system, the implementation team developed specific stress scenarios. Running these tests and interpreting their results allowed a successful adaptation of HPSS and made it the core component of GridKa tape storage system.
To increase the performance the architecture of the system was reshaped and the stored data has been collocated in a more appropriate, tape-oriented way to match the requirements of every single experiment and HPSS demands. In total 70 PB of data and 40 million files were migrated from the legacy to the new tape system at GridKa.
This contribution presents the internal architecture of the new tape storage system, the implementation and migration process, the encountered issues, the achieved results and ongoing work on open items.
The High Luminosity upgrade to the LHC (HL-LHC) is expected to generate scientific data on the scale of multiple exabytes. To tackle this unprecedented data storage challenge, the ATLAS experiment initiated the Data Carousel project in 2018. Data Carousel is a tape-driven workflow in which bulk production campaigns with input data resident on tape are executed by staging and promptly processing a sliding window to disk buffer such that only a small fraction of the input files are pinned on disk at any one time. Put in ATLAS production before Run3, Data Carousel continues to be our focus for seeking new opportunities in disk space savings, and enhancing tape usage throughout the ATLAS Distributed Computing (ADC) environment. These efforts are highlighted by two recent ATLAS HL-LHC demonstrator projects: data-on-demand and tape smart writing. We will discuss the recent studies and outcomes from these projects, along with various related improvements across the ATLAS distributed computing software. The research was conducted together with site experts at CERN and Tier-1 centers.
The Vera Rubin Observatory is a very ambitious project. Using the world’s largest ground-based telescope, it will take two panoramic sweeps of the visible sky every three nights using a 3.2 Giga-pixel camera. The observation products will generate 15 PB of new data each year for 10 years. Accounting for reprocessing and related data products the total amount of critical data will reach several hundred PB. Because the camera consists of 201 CCD panels, the majority of the data products will consist of relatively small files in the low megabyte range, impacting data transfer performance. Yet, all of this data needs to be backed up in offline storage and still be easily retrievable not only for groups of files but also for individual files. This paper describes how SLAC is building a Rucio-centric specialized Tape Remote Storage Element (TRSE) that automatically creates a copy of a Rucio dataset as a single indexed file avoiding transferring many small files. This not only allows high-speed transfer of the data to tape for backup and dataset restoral, but also simple retrieval of individual dataset members in order to restore lost files. We describe the design and implementation of the TRSE and how it relates to current data management practices. We also present performance characteristics that make backups of extremely large scale data collections practical.
Due to the increasing volume of physics data being produced, the LHC experiments are making more active use of archival storage. Constraints on available disk storage have motivated the evolution towards the "data carousel" and similar models. Datasets on tape are recalled multiple times for reprocessing and analysis, and this trend is expected to accelerate during the Hi-Lumi era (LHC Run-4 and beyond).
Currently, storage endpoints are optimised for efficient archival, but it is becoming increasingly important to optimise for efficient retrieval. This problem has two dimensions. To reduce unnecessary tape mounts, the spread of each dataset - the number of tapes containing files which will be recalled at the same time - should be minimised. To reduce seek times, files from the same dataset should be physically colocated on the tape. The Archive Metadata specification is an agreed format for experiments to provide scheduling and colocation hints to storage endpoints to achieve these goals.
This contribution describes the motivation, the review process with the various stakeholders and the constraints that led to the Archive Metadata proposal. We present the implementation and deployment in the CERN Tape Archive and our preliminary experiences of consuming Archive Metadata at WLCG Tier-0.
Online and real-time computing
Ensuring the quality of data in large HEP experiments such as CMS at the LHC is crucial for producing reliable physics outcomes. The CMS protocols for Data Quality Monitoring (DQM) are based on the analysis of a standardized set of histograms offering a condensed snapshot of the detector's condition. Besides the required personpower, the method has a limited time granularity, potentially hiding temporary anomalies. Unsupervised machine learning models such as auto encoders and convolutional neural networks have been recently deployed for anomaly detection with per-lumisection granularity. Nevertheless, given the diversity of detector technologies, geometries and physics signals characterizing each subdetector, different tools are developed in parallel and maintained by the sub detector experts. In this contribution, we discuss the development of an automated DQM for the online monitoring of the CMS Muon system, offering a flexible tool for the different muon subsystems based on deep learning models trained on occupancy maps. The potential flexibility and extensibility to different detectors, as well as the effort towards the integration of per-lumisection monitoring in the DQM workflow will be discussed.
Hydra is an advanced framework designed for training and managing AI models for near real time data quality monitoring at Jefferson Lab. Deployed in all four experimental halls, Hydra has analyzed over 2 million images and has extended its capabilities to offline monitoring and validation. Hydra utilizes computer vision to continually analyze sets of images of monitoring plots generated 24/7 during experiments. Generally, these sets of images are produced at a rate and quantity that is exceedingly difficult for shift crews to effectively monitor. Significant effort has been devoted to enhancing Hydra's user interface, to ensure that it provides clear, actionable insights for shift workers and other users. Gradient Weighted Class Activation Maps (GradCAM) provide added interpretability, allowing users to visualize important regions of the image for classification. Hydra has been containerized to enable the creation of portable demos and seamless integration with container-based technologies such as Kubernetes and Docker. With the user interface enhancements and containerization, Hydra can be rapidly deployed for new use cases and experiments. This talk will describe the Hydra framework, its user interface and experience, and the challenges inherent in its design and deployment.
The first level of the trigger system of the LHCb experiment (HLT1) reconstructs and selects events in real-time at the LHC bunch crossing rate in software using GPUs. It must carefully balance a broad physics programme that extends from kaon physics up to the electroweak scale. An automated procedure to determine selection criteria is adopted that maximises the physics output of the entirety of this programme while satisfying constraints from the higher-level components of the trigger system, which cap the output rate of HLT1 to around 1MHz. In this talk, the method by which this optimisation is achieved will be described in detail, which uses a variant of the ADAM algorithm popular in machine learning tools, customised in order to solve discrete minimisation problems. The impact of this optimisation on the first data taken by the LHCb experiment in its nominal Run 3 configuration will also be shown.
The architecture of the existing ALICE Run 3 on-line real time visualization solution was designed for easy modification of the visualization method used. In addition to the existing visualization based on the desktop application, a version using browser-based visualization has been prepared. In this case, the visualization is computed and displayed on the user's computer. There is no need to install any software on the user's computer. The overall visualization architecture used allows for a smooth switch to a new version of visualization: for a transition period both solutions (traditional desktop and web) can be used simultaneously.
ALICE visualization requires loading information about the displayed tracks (which may be several dozen thousand). This type of visualizations differs from visualizations typically used in computer graphics, where high efficiency of motion representation is achieved by modifying transformations describing the motion of already loaded models. In event visualization, the description of the tracks (models) changes with each view. Achieving high display performance requires the use of a number of optimization solutions.
The data downloaded by the web application is already pre-processed and prepared to be loaded to the graphics card, thanks to which the calculations in the browser are significantly simplified and the performance of the browser visualization is comparable to the visualization in the desktop application.
When creating a new visualization, a component approach to building a web application was used: individual components are responsible for various functions (e.g. data retrieval, different visualizations, interaction with the user). This construction of blocks allows for easy rearrangement by replacing or adding new components. The solution testing process is therefore significantly simplified because each component can be tested independently.
The LHCb experiment at CERN has undergone a comprehensive upgrade. In particular, its trigger system has been completely redesigned into a hybrid-architecture, software-only system that delivers ten times more interesting signals per unit time than its predecessor. This increased efficiency - as well as the growing diversity of signals physicists want to analyse - makes conforming to crucial operational targets on bandwidth and storage capacity ever more challenging. To address this, a comprehensive, automated testing framework has been developed that emulates the entire LHCb trigger and offline-processing software stack on simulated and real collision data. Scheduled both nightly and on-demand by software testers during development, these tests measure the online- and offline-processing's key operational performance metrics (such as rate and bandwidth), for each of the system's 3500 distinct physics selection algorithms, and their cumulative totals. The results are automatically delivered via concise summaries - to GitLab merge requests and instant messaging channels - that further link to an extensive dashboard of per-algorithm information. The dashboard and pages therein (categorised by physics working group) facilitate exploratory data analysis and test-driven trigger development by 100s of physicists, whilst the concise summaries enable efficient, data-driven decision-making by management and software maintainers. Altogether, this novel and performant bandwidth testing framework has been helping LHCb build an operationally-viable trigger and data-processing system whilst maintaining the efficiency to satisfy its physics goals.
The Mu2e experiment at Fermilab aims to observe coherent neutrinoless conversion of a muon to an electron in the field of an aluminum nucleus, with a sensitivity improvement of 10,000 times over current limits.
The Mu2e Trigger and Data Acquisition System (TDAQ) uses \emph{otsdaq} framework as the online Data Acquisition System (DAQ) solution.
Developed at Fermilab, \emph{otsdaq} integrates several framework components, an \emph{artdaq} based DAQ, an \emph{art} based event processing, and an EPICS-based detector control system (DCS), and provides a uniform multi-user interface to its components through a web browser.
The Mu2e tracker and calorimeter data streams are processed by a one-level software trigger implemented within the \emph{art} framework.
Events accepted by the trigger have their data combined, post-trigger, with the separately read-out data from the Mu2e Cosmic Ray Veto system.
The Mu2e DCS is built on EPICS (Experimental Physics and Industrial Control System), an open-source platform for monitoring, controlling, alarming, and archiving.
A prototype of the TDAQ and DCS systems has been built and tested over the last three years at Fermilab's Feynman Computing Center. Now, the production system installation is underway.
This report covers our project's development progress, especially the web-based user interface, and slow control implementation.
Offline Computing
The imminent high-luminosity era of the LHC will pose unprecedented challenges to the CMS detector. To meet these challenges, the CMS detector will undergo several upgrades, including replacing the current endcap calorimeters with a novel High-Granularity Calorimeter (HGCAL). A dedicated reconstruction framework, The Iterative Clustering (TICL), is being developed within the CMS Software (CMSSW). This new framework is designed to fully exploit the high spatial resolution and precise timing information provided by HGCAL, as well as the information from other subdetectors (e.g., Tracker and Mip-Timing-Detector). Its reconstruction capabilities aim to provide the final global event interpretation while mitigating the effects of the dense pile-up environment. The TICL framework, crafted with heterogeneous computing in mind, is a unique solution to the computing challenges of the HL-LHC phase. Data structures and algorithms have been developed for massively parallel architectures using the Alpaka performance portability library. The framework reconstructs particle candidates starting from the hundreds of thousands of energy deposits left in the calorimeter. Dedicated clustering algorithms have been developed to retain the physics information while reducing the problem complexity by order of magnitudes. Pattern recognition algorithms aim to reconstruct particle showers in the 3-dimensional space, striving for high efficiency and cluster purity, keeping the pile-up contamination as low as possible. The high-purity requirements, together with detector inhomogeneity, lead to fragmented 3D clusters. An additional linking step is available to recover the fragmentation. In this step, several algorithms are adopted to target different types of particle shower reconstruction. A SuperClustering linking plugin has been developed for electron and photon reconstruction, while a geometrical linking is used to target the hadron reconstruction. The final charged candidates are built by linking Tracks with the HGCAL 3D clusters, exploiting timing information from both HGCAL and MTD. This presentation will introduce the TICL framework. Its physics and computational performance will be highlighted, showcasing the approach adopted to face the challenges of HL-LHC.
In response to the increased data complexity anticipated with the upcoming upgrade of the Large Hadron Collider (LHC), the Compact Muon Solenoid (CMS) at LHC is developing an advanced endcap High-Granularity Calorimeter (HGCAL) capable of enduring the more demanding conditions of the High-Luminosity LHC with about 200 overlapping proton-proton collisions in a single bunch-crossing, resulting in several hundred thousands of hits in each endcap. During the particle shower reconstruction phase in HGCAL, 3D graph structures called tracksters are generated. These tracksters connect energy deposits across each layer of the detector, representing clusters of energy believed to originate from the same physics object. However, the inhomogeneous geometry of the detector, coupled with the lumpy nature of hadronic showers, particle overlaps and preceding algorithm cuts being tuned for high purity, often leads to full particle showers being fragmented into multiple tracksters. This effect compromises the quality of the reconstruction and requires addressing through an additional trackster linking step. This study delves into a machine learning approach leveraging Graph Neural Network (GNN) models with attention mechanisms to enhance the multi-purpose tasks of calorimetric event reconstruction, including trackster linking, energy regression, and particle identification. In this work, we show the result of applying the proposed model when evaluating hadronic showers data in the dense environment of HGCAL, with the network fully integrated into the CMS Software.
We present an ML-based end-to-end algorithm for adaptive reconstruction in different FCC detectors. The algorithm takes detector hits from different subdetectors as input and reconstructs higher-level objects. For this, it exploits a geometric graph neural network, trained with object condensation, a graph segmentation technique. We apply this approach to study the performance of pattern recognition in the IDEA detector using hits from the pixel vertex detector and the drift chamber. We also build particle candidates from detector hits and tracks in the CLD detector. Our algorithm outperforms current baselines in efficiency and energy reconstruction and allows pattern recognition in the IDEA detector. This approach is easily adaptable to new geometries and therefore opens the door to reconstruction performance-aware detector optimization.
We present an end-to-end reconstruction algorithm for highly granular calorimeters that includes track information to aid the reconstruction of charged particles. The algorithm starts from calorimeter hits and reconstructed tracks, and outputs a coordinate transformation in which all shower objects are well separated from each other, and in which clustering becomes trivial. Shower properties such as particle ID and energy are predicted from representative points within showers. This is achieved using an extended version of the object condensation loss, a graph segmentation technique that allows the clustering of a variable number of showers in every event while simultaneously performing regression and classification tasks. The backbone is an architecture based on a newly-developed translation-equivariant version of GravNet layers. These dynamically build learnable graphs from input data to exchange information along their edges. The model is trained on data from a simulated detector that matches the complexity of the CMS high-granularity calorimeter (HGCAL).
In the recent years, high energy physics discoveries have been driven by the increasing of detector volume and/or granularity. This evolution gives access to bigger statistics and data samples, but can make it hard to process results with current methods and algorithms. Graph neural networks, particularly graph convolution networks, have been shown to be powerful tools to address these challenges. These methods however raise some difficulties with their computing resource needs. In particular, representing physics events as graphs is a tricky problem that demands a good balance between resource consumption and graph quality, which can greatly affects the accuracy of the model.
We propose a graph convolution network pipeline architecture to perform classification and regression tasks on calorimeter events and discuss its performances. It is designed for resource constrained environments, and in particular to efficiently represent calorimeter events as graphs, allowing up to a quadratic improvement in complexity with satisfying accuracy. Finally, we discuss possible applications to other high energy physics detectors.
Particle identification (PID) is crucial in particle physics experiments. A promising breakthrough in PID involves cluster counting, which quantifies primary ionizations along a particle’s trajectory in a drift chamber (DC), rather than relying on traditional dE/dx measurements. However, a significant challenge in cluster counting lies in developing an efficient reconstruction algorithm to recover cluster signals from DC cell waveforms.
In PID, machine learning algorithms have emerged as the state-of-the-art. For simulated samples, an updated supervised model based on LSTM and DGCNN achieves a remarkable 10% improvement in separating K from $\pi$ compared to traditional methods. For test beam data samples collected at CERN, due to label scarcity and data/MC discrepancy, a semi-supervised domain adaptation model, which exploits Optimal Transport to transfer information between simulation and real data domains, is developed. The model is validated using pseudo data and further applied to real data. The performance is superior to the traditional methods and maintains consistent across varying track lengths.
There are two related papers that have been submitted to journals: 2402.16270 and 2402.16493. The previous one about the transfer learning has been accepted by the Computer Physics Communications (https://doi.org/10.1016/j.cpc.2024.109208).
Simulation and analysis tools
In this work we present the Graph-based Full Event Interpretation (GraFEI), a machine learning model based on graph neural networks to inclusively reconstruct events in the Belle II experiment.
Belle II is well suited to perform measurements of $B$ meson decays involving invisible particles (e.g. neutrinos) in the final state. The kinematical properties of such particles can be deduced from the energy-momentum imbalance obtained after reconstructing the companion $B$ meson produced in the event. This task is performed by reconstructing it either from all the particles in an event but the signal tracks, or using the Full Event Interpretation, an algorithm based on Boosted Decision Trees and limited to specific, hard-coded decay processes. A recent example involving the use of the aforementioned techniques is the search for the $B^+ \to K^+ \nu \bar \nu$ decay, that provided an evidence for this process at about 3 standard deviations.
The GraFEI model is trained to predict the structure of the decay chain by exploiting the information from the detected final state particles only, without making use of any prior assumptions about the underlying event. By retaining only signal-like decay topologies, the model considerably reduces the amount of background while keeping a relatively high signal efficiency. The performances of the model when applied to the search for $B^+ \to K^+ \nu \bar \nu$ are presented. The implementation of the model in the Belle II Analysis Software Framework is discussed.
In analyses conducted at Belle II, it is often beneficial to reconstruct the entire decay chain of both B mesons produced in an electron-positron collision event using the information gathered from detectors. The currently used reconstruction algorithm, starting from the final state particles, consists of multiple stages that necessitate manual configurations and suffers from low efficiency and a high number of wrongly reconstructed candidates.
Within this project, we are developing a software with the goal of automatically reconstructing B decays at Belle II with both high efficiency and accuracy. The trained models should be capable of accommodating rare decays with very small branching ratios, or even those that are unseen during the training phase.
To ensure optimal performance, the project is divided into the steps embedding of particles, particle reconstruction, and link prediction. Drawing inspiration from recent advancements in computer science, transformers and hyperbolic embedding are employed as fundamental components, with metric learning serving as the primary training technique.
Subatomic particle track reconstruction (tracking) is a vital task in High-Energy Physics experiments. Tracking, in its current form, is exceptionally computationally challenging. Fielded solutions, relying on traditional algorithms, do not scale linearly and pose a major limitation for the HL-LHC era. Machine Learning (ML) assisted solutions are a promising answer.
Current ML model design practice is predominantly ad hoc. We aim for a methodology for automated search of model designs, consisting of complexity reduced descriptions of the main problem, forming a complexity spectrum. As the main pillar of such a method, we provide the REDuced VIrtual Detector (REDVID) as a complexity-aware detector model and particle collision event simulator. Through a multitude of configurable dimensions, REDVID is capable of simulations throughout the complexity spectrum. REDVID can also act as a simulation-in-the-loop, to both generate synthetic data efficiently and to simplify the challenge of ML model design evaluation. With REDVID, starting from the simplistic end of the complexity spectrum, lesser designs can be eliminated in a systematic fashion, early on. REDVID is not bound by real detector geometries and can be considered for simulations involving arbitrary detector designs.
As a simulation and a generative tool for ML-assisted solution design, REDVID is highly flexible, reusable and open-source. Reference data sets generated with REDVID are publicly available. Data generated using REDVID has enabled rapid development of multiple novel ML model designs, which is currently ongoing.
Direct photons are unique probes to study and characterize the quark-gluon plasma (QGP) as they leave the collision medium mostly unscathed. Measurements at top Large Hadron Collider (LHC) energies at low pT reveal a very small thermal photon signal accompanied by considerable systematic uncertainties. Reduction of such uncertainties, which arise from the π0 and η measurements, as well as the photon identification, is crucial for the comparison of the results with the theoretical calculations that are available.
To address these challenges, a novel approach employing machine learning (ML) techniques has been implemented for the classification of photons and neutral mesons. An open-source set of frameworks comprising hipe4ml, scikit-learn, and ONNX packages is chosen for training, validation,and testing the model on a part of Run2 Pb–Pb data at √sNN = 5.02 TeV collision energy.
In this talk, the performance of the novel approach in comparison to the standard cut-based analysis is presented. Initial findings employing gradient-boosted decision trees demonstrate a substantial enhancement in photon purity while preserving efficiency levels comparable to those of the standard cut-based method. Strategies for addressing highly imbalanced data sets, including techniques like feature reduction during training and the implementation of scaled penalty factors to enhance discrimination between signal and background are also addressed. Finally, the feasibility of incorporating such ML methods into the main workflow of direct photon analysis is also presented.
Particle flow reconstruction at colliders combines various detector subsystems (typically the calorimeter and tracker) to provide a combined event interpretation that utilizes the strength of each detector. The accurate association of redundant measurements of the same particle between detectors is the key challenge in this technique. This contribution describes recent progress in the ATLAS experiment towards utilizing machine-learning to improve particle flow in the ATLAS detector. In particular, point-cloud techniques are utilized to associate measurements from the same particle, leading to reduced confusion compared to baseline techniques. Next steps towards further testing and implementation will be discussed.
Accurate modeling of backgrounds for the development of analyses requires large enough simulated samples of background data. When searching for rare processes, a large fraction of these expensively produced samples is discarded by the analysis criteria that try to isolate the rare events. At the Belle II experiment, the event generation stage takes only a small fraction of the computational cost of the whole simulation chain, motivating filters for the simulation at this stage. Deep neural network architectures based on graph neural networks have been proven useful to predict approximately which events will be kept after the filter, even in cases where there is no simple correlation between generator and reconstruction level quantities. However, training these models requires large training data sets, which are hard to obtain for filters with very low efficiencies. In this presentation we show how a generic model, pre-trained on filters with high efficiencies can be fine-tuned to also predict filters where only little training data is available. This also opens opportunities for online learning during the simulation process where no separate training step is required.
Collaborative software and maintainability
Given the recent slowdown of the Moore’s Law and increasing awareness of the need for sustainable and edge computing, physicists and software developers can no longer just rely on computer hardware becoming faster and faster or moving processing to the cloud to meet the ever-increasing computing demands of their research (e.g. the data rate increase in HL-LHC). However, algorithmic optimisations alone are also starting to be insufficient, so novel computing paradigms spanning both software and hardware appear. Adapting existing and new software to them may be difficult though, especially for large and complex applications. This is where profiling can help bridge the gap, but finding a suitable profiler is challenging when a low overhead, wide architectural support, and reliability are important.
As a response to the above problem, AdaptivePerf was developed. It is an open-source, architecture-portable, and low-overhead profiling tool with custom-patched Linux perf as its main foundation, currently available on GitHub. Thanks to the extensive research and modifications, AdaptivePerf improves the main shortcomings of perf such as incomplete stack traces. It profiles how threads and processes are created within a program and what code segments within each thread/process should be considered on- or off-CPU bottlenecks, in terms of both consumed time and other hardware metrics like cache misses. If a user-friendly visualisation is needed, AdaptivePerf can present results as a timeline with the process tree, where corresponding non-time-ordered and time-ordered flame graphs can be browsed along with functions spawning new threads/processes.
The tool has already been shown to work on x86-64 and RISC-V and is designed in the context of the SYCLOPS EU project, which CERN is part of and where solutions for heterogeneous architectures are developed, e.g. custom RISC-V cores tailored to a specific problem, RISC-V support for SYCL, and SYCL-accelerated algorithms in ROOT. In this presentation, we will talk about the profiler, its place within the project, and how it can be used for software-hardware co-design for HEP.
The software framework of the Large Hadron Collider Beauty (LHCb) experiment, Gaudi, heavily relies on the ROOT framework and its I/O subsystems for data persistence mechanisms. Gaudi internally leverages the ROOT TTree data format, as it is currently used in production by LHC experiments. However, with the introduction and scaling of multi-threaded capabilities within Gaudi, the limitations of TTree as a data storage backend have become increasingly apparent, marking it as a non-negligible bottleneck in data processing workflows.
The following work introduces a comprehensive two-part enhancement to Gaudi to address this challenge. An initial focus is given to optimizing the current n-tuple writing infrastructure to be thread-safe within the constraints of the existing TTree backend, thus maintaining compatibility for users and downstream applications. This phase is then followed by the migration of the n-tuple storage backend from TTree to RNTuple, ROOT's next-generation I/O subsystem for physics data storage. This migration aims at leveraging the thread-safe, asynchronous capabilities of the new data format, thus making Gaudi fit to handle the requirements of HL-LHC computing and beyond.
Keywords: LHCb; Gaudi; ROOT; TTree; RNTuple; thread-safety
A data quality assurance (QA) framework is being developed for the CBM experiment. It provides flexible tools for monitoring of reference quantity distributions for different detector subsystems and data reconstruction algorithms. This helps to identify software malfunctions and calibration status, to prepare a setup for the data taking and to prepare data for the production. A modular structure of the QA framework allows to keep independent QA units for different steps of the data reconstruction.
Since the offline and the online scenarios of data reconstruction need to meet different requirements, the QA framework is implemented differently for those two regimes. In the offline scenario, the data QA software is based on the FairRoot framework and is used to track the effects on data in the continuous development of the reconstruction algorithms as well as to check the data quality on the production stage. The QA software for the online reconstruction scenario utilizes the standard and boost C++ libraries and provides a real-time monitoring of detector and algorithm performance. This was successfully applied to the data taking at the mini-CBM experiment in May 2024.
CHEP Track: 6 - Collaborative software and maintainability
The LHCb high-level trigger applications consists of components that run reconstruction algorithms and perform physics object selections, scaling from hundreds to tens of thousands depending on the selection stage. The configuration of the components, the data flow and the control flow are implemented in Python. The resulting application configuration is condensed in the basic form of a list of components with their properties and values.
It is often required to change configuration without deploying new binaries. Moreover, it is essential to be able to reproduce a given production configuration and to be able query it after it has been used. For these reasons, the basic form of the trigger configuration is captured and stored in a Git database.
This contribution is describing the infrastructure around generating and validating the configurations. The process is based on GitLab pipelines that are triggered on user defined specifications and run several steps ranging from basic checks to performance validation using dedicated runners. Upon merging, the configuration database is deployed on CVMFS. The process as implemented ensures consistency and reproducibility across all selection stages.
This project also aims to take advantage of the query-able nature of the configurations by creating an API that allows probing a single configuration in detail. This is further used to create human-readable summaries and to track changes across configurations to help analysts understand the selections used to collect their datasets.
At the core of CERN's mission lies a profound dedication to open science; a principle that has fueled decades of ground-breaking collaborations and discoveries. This presentation introduces an ambitious initiative: a comprehensive catalogue of CERN's open-source projects, purveyed by CERN’s own OSPO. The mission? To spotlight every flag-bearing and nascent project under the CERN umbrella, making them accessible and known to the world.
This catalogue is a testament to CERN's commitment to open science and a tool to highlight all the pros of open source, foster collaboration, and stimulate innovation across the global scientific community. By curating this catalogue, the OSPO aims to not only showcase the breadth and depth of CERN's contributions to open-source software, but also to pave the way for engagement with researchers, external developers, and different institutions.
Discover how we're making open-source projects at CERN visible and why this matters for the future of scientific research. From technical challenges and solutions, to the strategic importance of open source in pushing HE(N)P discoveries forward, the journey so far has been filled with insights and stories that echo the essence that pushes for innovation at CERN. This is not just about showcasing projects; it's about building bridges in the open-source community and contributing to a legacy of open science.
Computing Infrastructure
The ePIC collaboration is working towards the realization of the first detector at the upcoming Electron-Ion Collider. As part of our computing strategy, we have settled on containers for the distribution of our modular software stacks using spack as the package manager. Based on abstract definitions of multiple mutually consistent software environments, we build dedicated containers on each commit of every pull request for the software projects under our purview. This is only possible through judicious caching from container layers, over downloaded artifacts and binary builds, down to individual compiled files. These containers are subsequently used for our benchmark and validation workflows. Our container build infrastructure runs with redundancy between GitHub and self-hosted GitLab resources, and can take advantage of cloud-based resources in periods of peak demand. In this talk, I will discuss our experiences with newer features of spack, including storing build products as OCI layers and inheritance of previously concretized environments for software stack layering.
The economies of scale realised by institutional and commercial cloud providers make such resources increasingly attractive for grid computing. We describe an implementation of this approach which has been deployed for
Australia's ATLAS and Belle II grid sites.
The sites are built entirely with Virtual Machines (VM) orchestrated by an OpenStack [1] instance. The Storage Element (SE) utilises an xrootd-s3 gateway [2][3] with back-end storage provided through an S3-compatible object store from a commercial provider. The provisioning arrangements required the deployment of some site-specific helper modules to ensure all SE interfacing requirements could be met. OpenStack hosts the xrootd redirector and proxy servers in separate VMs.
The Compute Element (CE) comprises virtual machines (VM) within the Openstack instance. Jobs are submitted and managed by HTCondor [4]. A CloudScheduler [5][6] instance is used to coordinate the number of active OpenStack VMs and ensure that VMs run only when there are jobs to run.
Automated configuration of the individual VMs associated with the grid sites is managed using Ansible [7]. This approach was chosen due to its low overheads and the simplicity of deployment.
Performance metrics of the resulting grid sites will be presented to illustrate the viability of this cost-effective approach to resource provisioning for grid computing.
[1] OpenStack: https://www.openstack.org/
[2] Xrootd: https://xrootd.slac.stanford.edu/
[3] Andrew Hanushevsky and Wei Yang: "Xrootd S3 Gateway for WLCG Storage", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), https://doi.org/10.1051/epjconf/202429501057
[4] HTCondor: https://htcondor.org/htcondor/overview/
[5] CloudScheduler: https://github.com/hep-gc/cloudscheduler
[6] Randall Sobie, F. Berghaus, K. Casteels, C. Driemel, M. Ebert, F. F. Galindo, C. Leavett-Brown, D. MacDonell, M. Paterson, R. Seuster, S. Tolkamp, J. Weldon: "cloudScheduler a VM provisioning system for a distributed compute cloud", 24th International Conference on Computing in High-Energy and Nuclear Physics (CHEP 2019), https://doi.org/10.1051/epjconf/202024507031
[7] Ansible: https://www.ansible.com/
A large fraction of computing workloads in high-energy and nuclear physics is executed using software containers. For physics analysis use, such container images often have sizes of several gigabytes. Executing a large number of such jobs in parallel on different compute nodes efficiently, demands the availability and use of caching mechanisms and image loading techniques to prevent network saturation and significantly reduce startup time. Using the industry-standard containerd container runtime for pulling and running containers, enables the use of various so-called snapshotter plugins that “lazily” load container images. We present a quantitative comparison of the performance of the CVMFS, SOCI, and Stargz snapshotter plugins. Furthermore, we also evaluate the user-friendliness of such approaches and discuss how such seamlessly containerised workloads contribute to the reusability and reproducibility of physics analyses.
In recent years, the CMS experiment has expanded the usage of HPC systems for data processing and simulation activities. These resources significantly extend the conventional pledged Grid compute capacity. Within the EuroHPC program, CMS applied for a "Benchmark Access" grant at VEGA in Slovenia, an HPC centre that is being used very successfully by the ATLAS experiment. For CMS, VEGA was integrated transparently as a sub-site extension to the Italian Tier-1 site at CNAF. In that first approach, only CPU resources were used, while all storage access was handled via CNAF through the network. Extending Grid sites with HPC resources was an established concept for CMS, however, in this project, HPC resources located in a different country from the Grid site were first integrated. CMS used the allocation primarily to validate a recent CMSSW release regarding its readiness for GPU usage. Former developments in the CMS workload management system that allow the targeting of GPU resources in the distributed infrastructure turned out to be instrumental and jobs could be submitted like any other release validation workflow. The presentation will detail aspects of the actual integration, some required tuning to achieve reasonable GPU utilisation, and an assessment of operational parameters like error rates compared to traditional Grid sites.
The Italian National Institute for Nuclear Physics (INFN) has recently developed a national cloud platform to enhance access to distributed computing and storage resources for scientific researchers. A critical aspect of this initiative is the INFN Cloud Dashboard, a user-friendly web portal that allows users to request high-level services on demand, such as Jupyter Hub, Kubernetes, and Spark clusters.
The platform is based on INDIGO-PaaS middleware, which integrates a TOSCA-based orchestration system. This system supports a lightweight federation of cloud sites and automates resource scheduling for optimal resource allocation.
Through the internal INFN DataCloud project and European initiatives like interTwin, INFN is undertaking a comprehensive overhaul of its PaaS system to adapt to evolving technologies and replace outdated software components. To further improve the orchestration system, INFN is exploring the use of artificial intelligence to enhance deployment scheduling.
Additionally, the dashboard, serving as a user interface for orchestrating and deploying services, has recently undergone significant renovations to boost usability and security. This contribution aims to highlight key advancements in the PaaS orchestration system designed to offer a reliable, scalable, and user-friendly environment for the computational needs of the scientific community.
Norwegian contributions to the WLCG consist of computing and storage resources in Bergen and Oslo for the ALICE and ATLAS experiments. The increasing scale and complexity of Grid site infrastructure and operation require integration of national WLCG resources into bigger shared installations. Traditional HPC resources often come with restrictions with respect to software, administration, and accessibility. Furthermore, expensive HPC infrastructure like fast interconnects is hardly used by grid workload.
As a cost-efficient solution, the Norwegian Grid resources are operated as two platforms within NREC, the Norwegian Research and Education Cloud, which is a cloud computing service operated by the Universities of Oslo and Bergen. It aims to provide easily accessible computing and storage infrastructure for national academic and scientific applications.
By using cloud technology instead of traditional HPC resources, WLCG installations benefit from a high degree of accessibility, flexibility, and scalability while the service provider ensures reliable and secure operation of infrastructure and network.
Orchestration of the virtual instances is based on the Infrastructure-as-a-service paradigm and implemented as declarative configuration files in Terraform. All custom host configuration, software deployment and cluster configuration are implemented as YAML code and deployed using Ansible.
This concept allows for the delivery of high-quality WLCG services with key features such as: fixed and opportunistic computing resources; ARC and JAliEn grid middleware; Slurm and HTCondor backend; CEPH disk storage integrated into Neic NDGF dCache; integrated tape storage; monitoring and alerting based on Prometheus/Grafana ecosystem; fully controlled setup by site admin; scalable extension; quick failover and recovery.
This presentation describes the capabilities of the Norwegian Research and Education Cloud and the strategy for provisioning of Grid computing and storage using the IaaS approach. Details on cluster management and monitoring as a service, flexible cluster orchestration, scalability and performance studies will be highlighted in the presentation.
Collaboration, Reinterpretation, Outreach and Education
CERN openlab is a unique resource within CERN that works to establish strategic collaborations with industry, fuel technological innovation and expose novel technologies to the scientific community.
ICT innovation is needed to deal with the unprecedented levels of data volume and complexity generated by the High Luminosity LHC. The current CERN openlab Phase VIII is designed to tackle these challenges on a number of fronts, including, but not limited to: heterogeneous computing, platforms, and infrastructures; novel storage, compression, and data management solutions; emerging low-latency interconnect and link protocols; and the exploitation of artificial intelligence algorithms across a multitude of domains, including edge devices for real-time event selection and triggering. The evaluation and adoption of these technologies are being accelerated by ongoing collaborations between industrial leaders in the relevant fields and the scientific community at CERN. The work of ongoing focussed projects in these areas will be summarised, and results demonstrating their impact will be shown. Incubator projects on emerging technologies such as digital twins and generative AI will be presented, as well as the next steps in these R&D efforts.
GlideinWMS is a workload manager provisioning resources for many experiments including CMS and DUNE. The software is distributed both as native packages and specialized production containers. Following an approach used in other communities like web development
we built our workspaces, system-like containers to ease development and testing.
Developers can change the source tree or check out a different branch and quickly reconfigure the services to see the effect of their changes.
In this paper, we'll talk about what differentiates workspaces from other containers.
We'll describe our base system composed of three containers. A one-node cluster including a compute element and a batch system. A GlideinWMS Factory controlling pilot jobs. And a scheduler and Frontend, to submit jobs and provision resources. Additional containers can be used for optional components. This system can easily run on a laptop and we'll share our evaluation of different container runtimes, with an eye for ease of use and performance.
Finally, we'll talk about our experience as developers and with students.
The GlideinWMS workspaces are easily integrated with IDEs like VS Code, simplifying debugging and allowing development and testing of the system also when offline.
They simplified the training and onboarding of new team members and Summer interns.
And they were useful in workshops where students could have first-hand experience with the mechanisms and components that, in production, run millions of jobs.
Virtual Reality (VR) applications play an important role in HEP Outreach & Education. They make it possible to organize virtual tours of the experimental infrastructure by virtually interacting with detector facilities, describing their purpose and functionalities. However, nowadays VR applications require expensive hardware, like the Oculus headset or MS Hololense, and powerful computers. As a result, this reduces the reach of VR application implementation and makes their benefits questionable. An important improvement to VR development is thus to facilitate the usage of inexpensive hardware, like Google cardboard and phones with average computational power.
Requirements to use inexpensive hardware and achieve quality and performance close to the advanced hardware bring challenges to the VR application developers. One of these challenges concerns the geometry of the 3D VR scenes. Geometry defines the quality of the 3D scenes and at the same time causes big loads on the GPU. Therefore, development methods of the geometry make it possible to find a good balance between the quality and performance of the VR applications.
The paper describes methods of the simplification of the "as-built" geometry descriptions; ways to reduce the number of facets to meet the GPU limitations in performance, and ensure the smooth movement in the VR scenes.
With the onset of ever more data collected by the experiments at the LHC and the increasing complexity of the analysis workflows themselves, there is a need to ensure the scalability of a physics data analysis. Logical parts of an analysis should be well separated - the analysis should be modularized. Where possible, these different parts should be maintained and reused for other analyses or reinterpretation of the same analysis.
Also, having an analysis prepared in such a way helps to ensure its reproducibility and preservation in the context of good data and analysis code management practices following the FAIR principles. In this talk, a few different topics on analysis modularization are discussed. An analysis on searches for pentaquarks within the LHCb experiment at CERN is used as an example.
Data Preservation (DP) is a mandatory specification for any present and future experimental facility and it is a cost-effective way of doing fundamental research by exploiting unique data sets in the light of the ever increasing theoretical understanding. When properly taken into account, DP leads to a significant increase in the scientific output (10% typically) for a minimal investment overhead (0.1%). DP relies on and stimulates cutting-edge technology developments and is strongly linked to Open Science and FAIR data paradigms. A recently released report (Eur.Phys.J.C 83 (2023) 9, 795 | 2302.03583 [hep-ex] ) summarizes the status of data preservation in high energy physics from a perspective of more than ten years of experience with a structured effort at international level (DPHEP).
Collaborative software development for particle physics experiments demands rigorous code review processes to ensure maintainability, reliability, and efficiency. This work explores the integration of Large Language Models (LLMs) into the code review process, with a focus on utilizing both commercial and open models. We present a comprehensive code review workflow that incorporates LLMs, integrating various enhancements such as multi-agent capabilities and reflection. Furthermore, tools are employed to facilitate the verification of suggested code changes before presentation in the review. By harnessing the capabilities of LLMs, the review process can uncover faults and identify improvements that traditional automated analysis tools may overlook. This integration shows promise for improving code quality, reducing errors, and fostering collaboration among developers in the field of particle physics software development.
The sheer volume of data generated by LHC experiments presents a computational challenge, necessitating robust infrastructure for storage, processing, and analysis. The Worldwide LHC Computing Grid (WLCG) addresses this challenge by integrating global computing resources into a cohesive entity. To cope with changes in the infrastructure and increased demands, the compute model needs to be adapted. Simulations of different compute models present a feasible approach for evaluating different design candidates. However, running these simulations incurs a trade-off between accuracy and scalability. For example, while the simulator DCSim can provide accurate results, it falls short on scalability when increasing the size of the simulated platform. Generative Machine Learning as a surrogate is successfully used to overcome these limitations in other domains that exhibit similar trade-offs between scalability and accuracy, such as the simulation of detectors.
In our work, we evaluate the usage of three different machine learning models as surrogate models for the simulation of distributed computing systems and assess their ability to generalize to unseen jobs and platforms. We show that those models can predict the simulated platforms' main observables derived from the execution traces of compute jobs with approximate accuracy. Potential for further improving the predictions lies in using other machine learning models and different encodings of the platform-specific information to achieve better generalizability for unseen platforms.
In the ATLAS analysis model, users must interact with specialized algorithms to perform a variety of tasks on their physics objects including calibration, identification, and obtaining systematic uncertainties for simulated events. These algorithms have a wide variety of configurations, and often must be applied in specific orders. A user-friendly configuration mechanism has been developed with the goal of improving the user experience from the perspective of both ease-of-use and stability. Users can now configure necessary algorithms via a YAML file, enabled by a physics-oriented python configuration. The configuration mechanism and training will be discussed.
We explore applications of quantum graph neural network(QGNN) on physics and non-physics data set. Based on a single quantum circuit architecture, we perform node, edge, and graph-level prediction tasks. Our main example is particle trajectory reconstruction starting from a set of detector data. Along with this, we expand our analysis on artificial helical trajectory data set. Finally, we will check how our quantum algorithm applies for non-physics data set as well by looking at Fingerprint data set in MUTAG, and AIDS data set, which collects molecular compounds graphs, foucisng on graph-level task.
The ATLAS experiment involves over 6000 active members, including students, physicists, engineers, and researchers, and more than 2500 members are authors. This dynamic CERN environment brings up some challenges, such as managing the qualification status of each author. The Qualification system, developed by the Glance team, aims to automate the processes required for monitoring the progress of ATLAS members as they work to achieve author status. Recently, ATLAS modified the policy governing authorship qualification, and updates were necessary to put the changes into effect.
The system’s code was originally developed on top of an outdated framework. In order to ease the transition to the new ATLAS authorship qualification policy, the code was updated to a Hexagonal architecture based on Domain Driven Design philosophy. The access to the database has shifted from ORM - Object Relational Mapper - to SQL repositories to align with the team’s development stack. The system's quality is ensured with automatic tests as part of an effective refactoring process transparent for the end user. This refactoring strategy enhances our system to meet both previously unaddressed and new requirements, to improve code maintainability, and to increase flexibility to accommodate possible future changes in the qualification policy.
The software of the ATLAS experiment at the CERN LHC accelerator contains a number of tools to analyze (validate, summarize, peek into etc.) all its official data formats recorded in ROOT files. These tools - mainly written in the Python programming language - handle the ROOT TTree which is currently the main storage object format of ROOT files. However, the ROOT project has developed an alternative to TTree, called RNTuple. The new storage format offers significant improvements and ATLAS plans to adopt it in LHC Run 4. Work is ongoing to enhance the tools in order to handle the RNTuple storage format in addition to TTree in a transparent for the user way. The work is aided by modern and detailed APIs provided by RNTuple. We will present the progress made and lessons learnt.
The ATLAS Tile Calorimeter (TileCal) is the central hadronic calorimeter of the ATLAS detector at the Large Hadron Collider at CERN. It plays an important role in the reconstruction of jets, hadronically decaying tau leptons and missing transverse energy, and also provides information to the dedicated calorimeter trigger. The TileCal readout is segmented into nearly 10000 channels that are calibrated using the dedicated calibration systems such as laser, charge injection, integrator and Cesium source.
Data quality assurance is paramount, with collision and calibration data subject to rigorous scrutiny. Automated checks are performed on predefined histograms, and the results are summarized on dedicated web pages. Operators use a suite of tools to further inspect the data and identify any issues or irregularities. The TileCal conditions data, including calibration constants and channel statuses, are therefore regularly updated in databases. These databases are used for data reprocessing and are also crucial for maintenance work during the technical stops.
In this talk, we will discuss the software tools used for data quality monitoring, emphasizing recent advancements and our pursuit of consolidating multiple tools into a more streamlined web application. Our overarching goal is to optimize the efficiency of the shifters responsible for monitoring data quality while simultaneously simplifying the entire process.
The distributed computing of the ATLAS experiment at the Large Hadron Collider (LHC) utilizes computing resources provided by the Czech national High Performance Computing (HPC) center, IT4Innovations. This is done through ARC-CEs deployed at the Czech Tier2 site, praguelcg2. Over the years, this system has undergone continuous evolution, marked by recent enhancements aimed at improving resource utilization efficiency.
One key enhancement involves the implementation of the HyperQueue meta-scheduler. It enables a division of whole-node jobs into several smaller, albeit longer, jobs, thereby enhancing CPU efficiency. Additionally, the integration of cvmfsexec enables access to the distributed CVMFS filesystem on compute nodes without requiring any special configurations, thereby substantially simplifying software distribution and broadening the range of tasks eligible for execution on the HPC. Another notable change was the migration of the batch system from PBSpro to Slurm.
The data processing and analyzing is one of the main challenges at HEP experiments, normally one physics result can take more than 3 years to be conducted. To accelerate the physics analysis and drive new physics discovery, the rapidly developing Large Language Model (LLM) is the most promising approach, it have demonstrated astonishing capabilities in recognition and generation of text while most parts of physics analysis can be benefitted. In this talk we will discuss the construction of a dedicated intelligent agent, an AI assistant at BESIII based on LLM, the potential usage to boost hadron spectroscopy study, and the future plan towards a AI scientist.
The huge volume of data generated by scientific facilities such as EuXFEL or LHC places immense strain on the data management infrastructure within laboratories. This includes poorly shareable resources of archival storage, typically, tape libraries. Maximising the efficiency of these tape resources necessitates a deep integration between hardware and software components.
CERN's Tape Archive (CTA) is an open-source storage management system developed by CERN to handle LHC data on tape. Although the primary target of CTA is CERN Tier-0, the Data Management Group considers CTA as the compelling alternative to commercial Hierarchical Storage Management (HSM) systems.
dCache, with its adaptable tape interface allows connectivity to any tape system. Collaborating closely with the CERN Tape Archive team, we have been working on the seamless integration of CTA into the dCache ecosystem.
This work shows the design, current progress, and initial deployment experiences of the dCache-CTA integration at DESY.
Monitoring the status of a high throughput computing cluster running computationally intensive production jobs is a crucial yet challenging system administration task due to the complexity of such systems. To this end, we train autoencoders using the Linux kernel CPU metrics of the cluster. Additionally, we explore assisting these models with graph neural networks to share information across threads within a compute node. The models are compared in terms of their ability to: 1) Produce a compressed latent representation that captures the salient features of the input, 2) Detect anomalous activity, and 3) Make distinction between different kinds of jobs run at Jefferson Lab. The goal is to have a robust encoder whose compressed embeddings are used for several downstream tasks. We extend this study further by deploying these models in a human-in-the-loop production-based setting for the anomaly detection task and discuss the associated implementation aspects such as continual learning and the criterion to generate alarms. This study represents a first step in the endeavor towards building self-supervised large-scale foundation models for computing centers.
Coprocessors, especially GPUs, will be a vital ingredient of data production workflows at the HL-LHC. At CMS, the GPU-as-a-service approach for production workflows is implemented by the SONIC project (Services for Optimized Network Inference on Coprocessors). SONIC provides a mechanism for outsourcing computationally demanding algorithms, such as neural network inference, to remote servers, where requests from multiple clients are intelligently distributed across multiple GPUs by a load-balancing service. This talk highlights the recent progress in deploying SONIC at selected U.S. CMS Tier-2 data centers. Using realistic CMS Run3 data processing workflows, such as those containing transformer-based algorithms, we demonstrate how SONIC is integrated into the production-like environment to enable accelerated inference offloading. We will present developments from both the client and server sides, including production job and data center configurations for NVIDIA and AMD GPUs. We will also present performance scaling benchmarks and discuss the challenges of operating SONIC in CMS production, such as server discovery, GPU saturation, fallback server logic, etc.
The event builder in the Data Acquisition System (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) is responsible for assembling events at a rate of 100 kHz during the current LHC run 3, and 750 kHz for the upcoming High Luminosity LHC, scheduled to start in 2029. Both the current and future DAQ architectures leverage on state-of-the-art network technologies, employing Ethernet switches capable of supporting RoCE protocols. The DAQ Front-end hardware is custom-designed, utilizing a reduced TCP/IP protocol implemented in FPGA for reliable data transport between custom electronics and commercial computing hardware.
An alternative architecture for the event builder, known as the File-based Event Builder (FEVB), is under evaluation. The FEVB comprises two separate systems: the Super-Fragment Builder (SFB) and the Builder File-based Filter Farm (BF3).
A super-fragment consists of the event data read by one or more Front-End Drivers and corresponding to the same L1 accept, and the SFB constructs multiple super-fragments corresponding to the number of Read-Unit (RU) machines in the DAQ system, storing them in local RAM disks. Subsequently, the BF3 accesses super-fragments from all RU machines via the Network File System (NFS) over Ethernet and builds complete events within the High Level Trigger process.
This paper describes the first prototype of the FEVB and presents preliminary performance results obtained within the DAQ system for LHC Run 3.
The LHCb Experiment employs GPU cards in its first level trigger system to enhance computing efficiency, achieving a data rate of 40Tb/s from the detector. GPUs were selected for their computational power, parallel processing capabilities, and adaptability.
However, trigger tasks necessitate extensive combinatorial and bitwise operations, ideally suited for FPGA implementation. Yet, FPGA adoption for compute acceleration is hindered by steep learning curves and very different programming paradigms with respect to GPUs and CPUs. In the last few years,interest in high level synthesis has grown because of the possibility of developing FPGA gateware in higher-level languages.
This study assesses the Intel® oneAPI FPGA Toolkit, which aims to simplify the development of FPGA-accelerated workloads by offering a GPU-like programming framework. We detail the integration of a portion of the current pixel clustering algorithm into oneAPI, address common implementation challenges, and compare it against CPU, GPU, and RTL implementations.
Our findings showcase promising outcomes for this emerging technology, potentially facilitating the repurposing of FPGAs in the data acquisition system as compute accelerators during idle data-taking periods.
Computing Centers always look for new server systems that can reduce operational costs, especially power consumption, and provide higher performance.
ARM-CPUs promise higher energy efficiency than x86-CPUs.
Therefore, the WLCG Tier1 center GridKa will partially use worker nodes with ARM-CPUs and has already carried out various power consumption and performance tests based on the HEPScore23 benchmark.
Various system settings, such as maximum CPU frequency, were studied to determine the best performance and highest energy efficiency of the ARM-CPU systems.
GridKa will provide the HEP community with several ARM-CPU worker nodes in their batch farm.
We present the results of these benchmarks on systems with ARM-CPUs compared to benchmarks of current x86-CPU worker nodes at GridKa and the status of provisioning ARM-CPU worker nodes to the community.
Dirac, a versatile grid middleware framework, is pivotal in managing computational tasks and workflows across a spectrum of scientific research domains including high energy physics and astrophysics. Historically, Dirac has employed specialized descriptive languages that, while effective, have introduced significant complexities and barriers to workflow interoperability and reproducibility. These challenges have become particularly pressing in light of the reproducibility crisis - an ongoing and pervasive issue that surfaced prominently in the early 2010s, marked by difficulties in replicating scientific results across different studies.
In response to these challenges, the integration of the Common Workflow Language (CWL) into Dirac represents a transformative development. CWL is a specification dedicated to the unambiguous definition and execution of computational workflows, facilitating their shareability and reusability across diverse computing environments. Its adoption within Dirac aims to standardize the description of computational tasks, thereby enhancing both reproducibility and interoperability.
By streamlining the interface for defining computational tasks within Dirac, we enable researchers to effortlessly transition workflows from local to grid-scale environments and foster compatibility with a broader ecosystem of scientific tools. This integration promises not only to mitigate the challenges posed by the reproducibility crisis but also to significantly lower the threshold for engaging with complex computational infrastructures, thus accelerating scientific discovery and innovation across multiple disciplines.
CERN has a huge demand for computing services. To accommodate this requests, a highly-scalable and highly-dense infrastructure is necessary.
To accomplish this, CERN adopted Kubernetes, an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications.
This session will discuss the strategies and tooling used to simplify the use of Kubernetes, in particular:
- one-click deployment of any application from a git repository
- zero-config creation of CD pipelines
- specialized managed clusters for common use cases
- dashboards to manage deployments across different clusters
Over the past years, the ROOT team has been developing a new I/O format called RNTuple to store data from experiments at CERN's Large Hadron Collider. RNTuple is designed to improve ROOT's existing TTree I/O subsystem by improving I/O speed and introducing a more efficient binary data format. It can be stored in both ROOT files and object stores, and it's optimized for modern storage hardware like NVMe SSDs.
The ATLAS experiment plans to use RNTuple as its primary storage container in the upcoming HL-LHC.
There's been significant progress in integrating RNTuple into the ATLAS event processing framework, and now all production ATLAS data output formats support it. Performance studies with open-source data have shown substantial improvements in space resource usage. The reported study examines the I/O throughput and disk-space savings achieved with RNTuple for various ATLAS data output formats, including RDO, ESD, AOD, and various DAOD. These measurements will have an important impact on the computing resource needs of the ATLAS experiment for HL-LHC operation.
In this study, we introduce the JIRIAF (JLAB Integrated Research Infrastructure Across Facilities) system, an innovative prototype of an operational, flexible, and widely distributed computing cluster, leveraging readily available resources from Department of Energy (DOE) computing facilities. JIRIAF employs a customized Kubernetes orchestration system designed to integrate geographically dispersed resources into a unified, elastic distributed cluster. This system operates without the need for additional infrastructure investments by resource providers. Notably, JIRIAF has demonstrated a capability to process data streams at rates up to 100 Gbps, facilitating real-time data-stream processing across vast distances.
Furthermore, we developed a digital representation of workflows using a Bayesian probability graph model. This model utilizes a standard joint probability distribution to represent various probabilities associated with the digital state, including relevant quantities and potential rewards, all derived from observed actions and data. The determination of these quantities and rewards employs queueing theory, focusing on two critical metrics: the rate of workflow input and the processing rate. Our results confirm the efficacy of the JIRIAF digital twin in managing and orchestrating highly distributed workflows, showcasing its potential to significantly enhance computational resource utilization and process efficiency in complex environments.
In the realm of high-energy physics research, the demand for computational
power continues to increase, particularly in online applications such as Event
Filter. Innovations in performance enhancement are sought after, leading to
exploration in integrating FPGA accelerators within existing software
frameworks like Athena, extensively employed in the ATLAS experiment at CERN.
This presentation delves into the intricacies of this integration, focusing on
the system-level challenges posed by the simultaneous utilization of FPGA
resources by multiple Athena algorithms in the heterogeneous computing
environment explored for the TDAQ Phase II upgrade.
Central to this discussion is the notion of shared state management,
particularly concerning the loading of FPGA bitstreams. As multiple algorithms
contend for access to the same FPGA, efficient management of the FPGA's state
becomes crucial to ensure optimal performance and resource utilization. This
work addresses this challenge, presenting insights and strategies for
orchestrating FPGA resource sharing within the Athena framework.
While still a work in progress, this contribution provides valuable insights
into the ongoing efforts to seamlessly integrate FPGA accelerators into complex
research environments, paving the way for enhanced computational capabilities.
This study explores possible enhancements in analysis speed, WAN bandwidth efficiency, and data storage management through an innovative data access strategy. The proposed model introduces specialized "delivery" services for data preprocessing, which include filtering and reformatting tasks executed on dedicated hardware located alongside the data repositories at the CERN Tier-0 or at Tier-1 or Tier-2 facilities. Positioned near the source storage, these services are crucial for limiting redundant data transfers and focus on sending only vital data to distant analysis sites, aiming to optimize network and storage use at those sites. Within the scope of the NSF-funded FABRIC Across Borders (FAB) initiative, we assess this model using an "in-network, edge" computing cluster at CERN, outfitted with substantial processing capabilities (CPU, GPU, and advanced network interfaces). This edge computing cluster features dedicated network peering arrangements that link CERN Tier-0, the FABRIC experimental network, and an analysis center at the University of Chicago, creating a solid foundation for our research.
Central to our infrastructure is ServiceX, an R&D software project under the Data Organization, Management, and Access (DOMA) group of the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP). ServiceX is a scalable filtering and reformatting service, designed to operate within a Kubernetes environment and deliver output to an S3 object store at an analysis facility. Our study assesses the impact of server-side delivery services in augmenting the existing HEP computing model, particularly evaluating their possible integration within the broader WAN infrastructure. This model could empower Tier-1 and Tier-2 centers to become efficient data distribution nodes, enabling a more cost-effective way to disseminate data to analysis sites and object stores, thereby improving data access and efficiency. This research is experimental and serves as a demonstrator of the capabilities and improvements that such integrated computing models could offer in the HL-LHC era.
The ATLAS Metadata Interface (AMI) ecosystem has been developed within the context of ATLAS, one of the largest scientific collaborations. AMI is a mature, generic, metadata-oriented ecosystem that has been maintained for over 23 years. This paper briefly describes the main applications of the ecosystem within the experiment, including metadata aggregation for millions of datasets and billions of files, searching for datasets by metadata criteria, and metadata definition for data processing jobs (AMI-tags). The current architecture of the underlying databases will be outlined, in addition to the ongoing developments for preparations for Run 4. Optimizations based on advanced partitioning will also be described, enabling all datasets to be migrated into a new single catalog.
CERN IT has offered a Kubernetes service since 2016, expanding to incorporate multiple other technologies from the cloud native ecosystem over time. Currently the service runs over 500 clusters and thousands of nodes serving use cases from different sectors in the organization.
In 2021 the ATS sector showed interest in looking at a similar setup for their container orchestration effort. A collaboration was started with an initial proof of concept running the CERN IT service inside the control room datacenter, including use cases from multiple teams in the sector. Following a successful initiative that ran over a year, a second phase was launched to bring the service to production.
In this paper we describe the existing CERN IT service and the major changes and improvements that were required to serve accelerator control use cases. We highlight the changes due to running in an isolated, air-gapped network environment, as well as the additional integrations regarding identity, storage and datacenter infrastructure. Finally we detail results from an extensive effort for failure scenario evaluation to comply with the expected service levels, as well as plans for extending the existing infrastructure to new use cases.
To operate ATLAS ITk system tests and later the final detector, a graphical operation and configuration system is needed. For this a flexible and scalable framework based on distributed microservices has been introduced. Different microservices are responsible for configuration or operation of all parts of the readout chain.
The configuration database microservice provides the configuration files needed to configure the hardware components of the readout chain and perform scans using the DAQ software. It saves the connectivity information and configuration files for the operation of the system in so called runkeys. These runkeys are stored in a flexible, tree-based data structure. This flexible structure allows the storage of specialized runkeys made up of different objects for each of the ITk subdetectors within the same database.
It is investigated whether a single-instance database is sufficient to efficiently serve these files to the subdetectors or if a distributed system of local ConfigDB caches is needed. These caches would each provide only a subset of the runkeys depending on the elements of the readout chain that the specific cache serves.
The ALICE Collaboration aims to precisely measure heavy-flavour (HF) hadron production in high-energy proton-proton and heavy-ion collisions since it can provide valuable tests of perturbative quantum chromodynamics models and insights into hadronization mechanisms. Measurements of the Ξ$_c^+$ and Λ$_c^+$ production decaying in a proton (p) and charged π and K mesons are remarkable examples of investigation in the HF sector. Like in other ALICE analyses, a quite novel approach based on Boosted Decision Tree (BDT) classifiers has been adopted to discriminate the signal yields from the background processes. Especially for Ξ$_c^+$ → pπK process, the Machine Learning (ML)-based approach is required and particularly challenging due to its large combinatorial background, small branching ratio, and short O(100 µm) decay length of Ξ$_c^+$ baryon. FAIR, a European project synergic to the ALICE experiment, aims to set up an open-source, user-friendly and interactive pytorch-based environment external to the official ALICE framework to perform BDT-based multivariate analyses. The FAIR benchmark imports different ML packages (XGBoost, Sklearn and Ray) to prepare the data and configure the BDT models in Jupyter Notebooks. Currently, the training is performed on a preliminary dataset with limited statistics using a partitioned shared GPU available through an Apache Mesos cluster at the ReCaS-Bari datacenter. In the future, when a larger dataset will be available, we intend to leverage a GPU-powered Kubernetes cluster for processing large-scale applications, including ML tool training. This contribution will present a performance comparison of the investigated BDT architectures trained with simulated signal events and background Run 3 data provided by ALICE.
The OMS data warehouse (DWH) constitutes the foundation of the Online Monitoring System (OMS) architecture within the CMS experiment at CERN, responsible for the storage and manipulation of non-event data within ORACLE databases. Leveraging on PL/SQL code, the DWH orchestrates the aggregation and modification of data from several sources, inheriting and revamping code from the previous project known as Web Based Monitoring to meet evolving requirements. The main goals of the DWH restructuring were: the modernization of inherited PL/SQL code, necessitating the creation of new aggregation tables and the implementation of enhancements such as standardized naming conventions; improved development workflows; and continuous integration strategies. DWH is composed of multiple Oracle schemas and integrates external PL/SQL libraries, in particular the CERN Beams Common4Oracle library, which consolidates common functionalities from various CERN Beams department databases into a unified codebase for widespread application. This article delves into the architecture and development strategies employed within the OMS data warehouse, underscoring its role in facilitating efficient data aggregation and management within the OMS project in the CMS experiment at CERN.
The Super Tau-Charm Facility (STCF) is the new generation $e^+$$e^−$ collider aimed at studying tau-charm physics. The particle identification (PID), as one of the most fundamental tools for various physics research in STCF experiment, is crucial for achieving various physics goals of STCF. In the recent decades, machine learning (ML) has emerged as a powerful alternative for particle identification in HEP experiments. ML algorithms, such as neural networks and boosted decision trees, have shown superior performance in handling complex and multi-dimensional data, making them well-suited for integrating particle identification information from multiple sub-detector systems. In this work, we present a powerful PID software based on ML techniques, including a global PID algorithm for charged particles combining information from all sub-detectors, as well as a deep CNN discriminating neutral particles based on calorimeter responses. The preliminary results show the PID models has achieved excellent PID performance, greatly boosting the physics potential of STCF.
The PATOF project builds on work at MAMI particle physics experiment A4. A4 produced a stream of valuable data for many years which already released scientific output of high quality and still provides a solid basis for future publications. The A4 data set consists of 100 TB and 300 million files of different types (Vague context because of hierarchical folder structure and file format with minimal metadata provided .).
In PATOF we would like to build a “FAIR Metadata Factory ”, i.e. a process to create a naturally evolved metadata schema that can be used across research fields. The first focus will be on creating machine-readable XML files containing metadata from the logbook and other sources and to further enrich them .
In PATOF, we intend to conclude the work on A4 data, to extract the lessons learned there in the form of a cookbook that can capture the methodology for making individual experiment-specific metadata schemas FAIR, and to apply it to four other experiments: The ALPS II axion and dark matter search experiment at DESY. The PRIMA experiment at MAMI in Mainz for measuring the pion transition form factor. The upcoming nuclear physics experiment P2 at MESA in Mainz. Finally, the LUXE experiment at DESY planned to start in 2026. The focus of PATOF is on making these data fully publicly available.
The objectives of the project are i) a FAIR Metadata Factory (i.e. a cookbook of (meta)data management recommendations), and ii) the FAIRification of data from concrete experiments. Both aspects are inherently open in nature so that everybody can profit from PATOF results. The cookbook is expected to be further enhanced with contributions from other experiments even after PATOF (“living cookbook”).
Developments in microprocessor technology have confirmed the trend towards higher core counts and decreased amount of memory per core, resulting in major improvements in power efficiency for a given level of performance. Core counts have increased significantly over the past five years for the x86_64 architecture, which is dominating in the LHC computing environment, and the higher core density is not only a feature of large HPC systems, but is also readily available on commodity hardware preferentially used at Grid sites. The baseline multi-core workloads are however still largely based on 8-cores. The job are sized accordingly in terms of number of events processed. The new multi-threaded AthenaMT framework has been introduced for ATLAS data processing and simulation for Run-3 in order to address the performance limitations of the classic single-threaded Athena when run in parallel in multi-core jobs. In this work, the performance of some ATLAS workloads is investigated when scaling up core counts up to whole node where possible and at different job sizes with the aim of providing input to software developers.
CMS has deployed a number of different GPU algorithms at the High-Level Trigger (HLT) in Run 3. As the code base for GPU algorithms continues to grow, the burden for developing and maintaining separate implementations for GPU and CPU becomes increasingly challenging. To mitigate this, CMS has adopted the Alpaka (Abstraction Library for Parallel Kernel Acceleration) library as the performance portability solution to provide a single-code base for parallel execution on both GPUs and CPUs in CMS software (CMSSW).
A direct CUDA version of HCAL energy reconstruction, called Minimization At Hcal, Iteratively (MAHI), has been deployed at the HLT in the 2022-2023 data taking period. This contribution will describe how the CUDA version is converted into a portable implementation using the Alpaka library. We will discuss the porting experience from CUDA to Alpaka, the validation process and the performance of the Alpaka version in CPU and GPU.
Efficient, ideally fully automated, software package building is essential in the computing supply chain of the CERN experiments. With Koji, a very popular package software building system used in the upstream Enterprise Linux communities, CERN IT provides a service to build software and images for the Linux OSes we support. Due to the criticality of the service and the limitations in Koji's built-in monitoring, the CERN Linux team implemented new functionality to allow integration with Prometheus, an open-source monitoring system and time-series database. This contribution will give an overview of Koji and its integration with Prometheus and Grafana, explain the challenges we tackled during the development of the integration, and how we're benefiting from these new metrics to improve the quality of the service.
TechWeekStorage24 was introduced by CERN IT Storage and Data Management group as a new “Center of Excellence” community networking format: a co-located series of events on Open Source Data Technologies, bringing together a wide range of communities, far beyond High Energy Physics and highlighting the wider technology impact of IT solutions born in HEP.
Combining the annual CS3 conference, CERN Storage Day, EOS and CTA Workshops created week-long opportunity for connection, collaboration and discussion on storage services, open source software-defined storage and data management technology, data policies & trends, innovative applications, collaboration platforms, digital sovereignty, FAIR and Open Science, Security and Privacy of Data and more.
This new event format is also environmentally more sustainable: participants from locations such as Brazil, US, China, Japan, Korea had an opportunity to attend multiple related events within a single trip.
https://techweekstorage.web.cern.ch
Over time, the idea of exploiting voluntary computing resources as additional capacity for experiments at the LHC has given rise to individual initiatives such as the CMS@Home project. With a starting point of R&D prototypes and projects such as "jobs in the Vacuum" and SETI@Home, the experiments have tried integrating these resources into their data production frameworks transparently to the computing infrastructure. Many of these efforts were subsequently rolled into the umbrella LHC@Home project. The use of virtual machines instantiated on volunteer resources, with images created and managed by the experiment according to its needs, provided the opportunity to implement this integration, and virtualization enabled CMS code from a Linux environment to also run on Windows and Macintosh systems, realizing a distributed and heterogeneous computing environment. A prototype of CMS@Home integrated with the CMS workload management CRAB3 was proposed in 2015, demonstrating the possibility of using BOINC as "manager" of volunteer resources and adapting the "vacuum" concept with the HTCondor Glidein system to get CMS pilots and jobs to execute on volunteers' computers. Since then, the integration of volunteer machines with the CMS workload management WMAgent, the official service dedicated to data production, has been seriously considered. The characteristics of volunteer resources regarding bandwidth capacity, connection behavior, and CPU and RAM capacities make them suitable for low-priority workflows with low I/O demands. The poster describes how the configuration of volunteer resources has evolved to keep pace with the development of the CMS computing infrastructure, including using tokens for resource authentication, exploiting regular expressions to accept workflows, manual glideins to initiate pilots, and other implementation details to achieve successful workflows. Currently volunteers are able to execute task chains also of multicore jobs and, despite their limitations, are contributing to CMS computing capacity with around 600 cores daily.
With an electron-positron collider operating at center-of-mass-energy 2∼7 GeV and a peak luminosity above 0.5 × 10^35 cm^−2 s^−1, the STCF physics program will provide an unique platform for in-depth studies of hadron structure and non-perturbative strong interaction, as well as probing physics beyond the Standard Model at the τ-Charm sector succeeding the present Being Electron-Positron Collider II (BEPCII). To fulfill the physics targets and to further maximize the physics potential at the STCF, not only the particles that decay immediately upon production but also the long-lived particles, e.g. the lambda baryon, which may decay within or outside the inner tracker hence leaving very limited number of hits at the inner tracker, should be reconstructed with good efficiency.
A Common Tracking Software (ACTS) provides a set of performant track reconstruction tools which are agnostic to the details of the detection technologies and magnetic field configuration. Due to its excellent performance, ACTS has been used as a tracking toolkit by various experiments such as ATLAS, sPHENIX, FASER etc. Preliminary results of using ACTS seeding and Combinatorial Kalman Filter algorithms for STCF have been obtained. However, it's found that the tracking performance of ACTS seeding for long-lived particles at STCF is far from satisfactory, due to the fact that the STCF inner tracker has only three layers. Therefore, improving the tracking performance of ACTS for long-lived particles at STCF by combining the global track finding algorithm Hough Transform and the local track following algorithm CKF has been investigated.
In this talk, we will present the tracking performance of ACTS for STCF, which has a tracking system with a three-layer inner tracker and a drift chamber. Improvement of the tracking performance for long-lived particles at STCF using a combined global Hough Transform and the Combinatorial Kalman Filter will be highlighted.
LUX-ZEPLIN (LZ) is a dark matter direct detection experiment. Employing a dual-phase xenon time projection chamber, the LZ experiment set a world leading limit for spin-independent scattering at 36 GeV/c2 in 2022, rejecting cross sections above 9.2×10−48 cm2 at the 90% confidence level. Unsupervised machine learning methods are indispensable tools in working with big data, and have been applied at various stages of LZ analysis for data exploration and anomaly detection. In this work, we discuss an unsupervised dimensionality reduction approach applied to a combination of both PMT waveforms and reconstructed features aiming to identify anomalous events. We examine the tradeoffs in this method, and compare our results to known anomalies in the data, as well as conventional data quality cuts.
The HEP-RC group at UVic used Dynafed intensively to create federated storage clusters for Belle-II and ATLAS; which was used by worker nodes deployed on clouds around the world. Since the end of the DPM development also means the end of the development for Dynafed, xrootd was tested with S3 as backend to replace Dynafed. We will show similarities as well as major differences between the two systems as well as results of tests we run on both, for data transfers, checksum calculations as well as clustering of different endpoints. This may help other to efficiently make use of S3 storage as a WLCG site SE.
Large Language Models (LLMs) are undergoing a period of rapid updates and changes, with state-of-art model frequently being replaced. WEhen applying LLMs to a specific scientific field it is challenging to acquire unique domain knowledge while keeping th emodel ifself advanced. To address this challenge, a sophisticated large language model system named Xiwu has been developed, allowing switching the most advanced foundation models flexibly and quickly. In this talk, we will discuss one of the best practices of applying LLMs in HEP including some seed fission tools which can collect and clean the HEP dataset quickly, a just-in-time learning system based on vector store technology, and an on-the-fly fine-tuning system. The results show that Xiwu can smoothly switch different models such as LLaMa, Vicuna, chatGLM and Grok-1, and the trained Xiwu model is significantly outperformed the benchmark model on the HEP knowledge in question-and-answering and code generation.
Data and Metadata Organization, Management and Access
The dCache project provides open-source software deployed internationally
to satisfy ever-more demanding storage requirements. Its multifaceted
approach provides an integrated way of supporting different use-cases
with the same storage, from high throughput data ingest, data sharing
over wide area networks, efficient access from HPC clusters, and long
term data persistence on tertiary storage. Although dCache
was originally developed for HEP experiments, today it is used by
various scientific communities, including astrophysics, biomed, and
life science, each with its specific requirements. To match the
requirements of these new communities and keep up with the scaling
demands of existing experiments, dCache evolution is a permanent
ongoing process. With this contribution, we would like to highlight
the recent developments in dCache regarding integration with CERN
Tape Archive (CTA), advanced metadata handling, token-based
authorization support, bulk API for QoS transitions, RESTAPI to
control interaction with the tape system, and future development
directions.
After the deprecation of the open-source Globus Toolkit used for GridFTP transfers, the WLCG community has shifted its focus to the HTTP protocol. The WebDAV protocol extends HTTP to create, move, copy and delete resources on web servers. StoRM WebDAV provides data storage access and management through the WebDAV protocol over a POSIX file system. Mainly designed to be used by the WLCG community, StoRM WebDAV supports authentication through X.509 certificates, VOMS proxies and JWT tokens. Moreover, Third-Party Copies (an extension of the WebDAV COPY verb to support copies between data centers) are supported.
With the aim of improving data transfer performance, this contribution describes the changes made to StoRM WebDAV in order to delegate file transfers to the external reverse proxy NGINX, decoupling them from the internal Java implementation. To even more simplify the StoRM WebDAV codebase, also the validation of VOMS proxies and JWT tokens is delegated to NGINX, augmented with specific modules developed by us. Even with this solution, authorization is still enforced by StoRM WebDAV.
Following the effort of the WLCG community to have better metrics about data flows, this contribution also describes the work done in order to support SciTags, an initiative promoting identification of the science domains and their high-level activities at the network level.
Managing the data deluge generated by large-scale scientific collaborations is a challenge. The Rucio Data Management platform is an open-source framework engineered to orchestrate the storage, distribution, and management of massive data volumes across a globally distributed computing infrastructure. Rucio meets the requirements of high-energy physics, astrophysics, genomics, and beyond, pioneering new ways to facilitate research at the exabyte-scale.
This presentation introduces Rucio, highlighting its key features and strategic roadmap that underscore its flexibility towards diverse scientific domains, deep diving into concrete operational experience from various EU projects (ESCAPE, DaFab, InterTwin).
A special emphasis will be placed on the contributions of the CERN IT department, whose active engagement with the Rucio project has increased recently and catalysed significant contributions to the core software. This collaboration has not only enhanced Rucio’s capabilities but also solidified its role in LHC experiments such as ATLAS and CMS, and provided a path forward for SMEs (Small and Medium experiments) to benefit from a converged data management platform.
The data movement manager (DMM) is a prototype interface between the CERN developed data management software Rucio and the software defined networking (SDN) service SENSE by ESNet. It allows for SDN enabled high energy physics data flows using the existing worldwide LHC computing grid infrastructure. In addition to the key feature of DMM, namely transfer-priority based bandwidth allocation for optimal network usage; it also allows for the identification of the exact cause of underperforming flows using end-to-end monitoring of the data flows by having access to host (network interface) level throughput metrics and transfer-tool (FTS) data transfer job level metrics. This paper describes the design and implementation of DMM.
The Large Hadron Collider (LHC) experiments rely heavily on the XRootD software suite for data transfer and streaming across the Worldwide LHC Computing Grid (WLCG) both within sites (LAN) and across sites (WAN). While XRootD offers extensive monitoring data, there's no single, unified monitoring tool for all experiments. This becomes increasingly critical as network usage grows, and with the High-Luminosity LHC (HL-LHC) demanding even higher bandwidths.
The "Shoveler" system addresses this challenge by providing a platform to collect and visualize XRootD traffic data from all four LHC experiments, separated by type, direction and locality of the traffic. This contribution explores the Shoveler plus Collector architecture, its current deployment status at WLCG sites, and validates its collected information by comparing it with data from individual experiment monitoring frameworks.
The WLCG community, with the main LHC experiments at the forefront, is moving away from x509 certificates, replacing the Authentication and Authorization layer with OAuth2 tokens. FTS, as a middleware and core component of the WLCG, plays a crucial role in the transition from x509 proxy certificates to tokens. The paper will present in-detail the FTS token design and how this will serve the needs of the community, WLCG and non-WLCG alike. Finally, a chapter will also report on performance measurements and lessons learned during the DataChallenge 2024
Online and real-time computing
A new algorithm, called "Downstream", has been developed and implemented at LHCb, which is able to reconstruct and select very displaced vertices in real time at the first level of the trigger (HLT1). It makes use of the Upstream Tracker (UT) and the Scintillator Fiber detector (SciFI) of LHCb and it is executed on GPUs inside the Allen framework. In addition to an optimized strategy, it utilizes a Neural Network (NN) implementation to increase the track efficiency and reduce the ghost rates, with very high throughput and limited time budget. Besides serving to reconstruct Ks and Lambda vertices to calibrate and align the detectors, the Downstream algorithm and the associated two-track vertexing will largely increase the LHCb physics potential for detecting long-lived particles during the Run3.
The event reconstruction in the CBM experiment is challenging.
There will be no simple hardware trigger due to the novel concepts of free-streaming data and self-triggered front-end electronics.
Thus, there is no a priori association of signals to physical events.
CBM will operate at interaction rates of 10 MHz, unprecedented for heavy ion experiments.
At this rate, collisions overlap in time and are to be resolved in software by reconstruction algorithms.
These complications made the speed and quality of the data reconstruction crucial.
The core of the track reconstruction is the Cellular Automaton (CA) based algorithm used for the Silicon Tracking System (STS).
It digests free-streaming data both online and offline, taking large time slices of the hit measurements as input with non-a priori-defined physical collisions.
The data is reconstructed in time portions by applying a nonmerging sliding window algorithm, which achieves almost constant
time per event regardless of the time slice size.
The algorithm was successfully applied to run online for the mini-CBM experiment during the March 2024 data-taking campaign.
In this presentation, we introduce BuSca, a prototype algorithm designed for real-time particle searches, leveraging the enhanced parallelization capabilities of the new LHCb trigger scheme implemented on GPUs. BuSca is focused on downstream reconstructed tracks, detected exclusively by the UT and SciFi detectors. By projecting physics candidates onto 2D histograms of flight distance and mass hypotheses at a remarkable 30 MHz rate, BuSca identifies hot spots indicative of potential candidates of new particles, thereby providing strategic guidance for the development of new trigger lines. Additionally, BuSca offers an Armenteros-Podolanski representation, providing insights into the mass hypotheses of the decay products associated with the new particle. The performance of BuSca, including the outcomes of its initial prototype on simulated data, will be elucidated in this talk.
Online reconstruction is key for monitoring purposes and real time analysis in High Energy and Nuclear Physics (HEP) experiments. A necessary component of reconstruction algorithms is particle identification (PID) that combines information left by a particle passing through several detector components to identify the particle’s type. Of particular interest to electro-production Nuclear Physics experiments such as CLAS12 is electron identification which is used to trigger data recording. A machine-learning approach was developed for CLAS12 to reconstruct and identify electrons by combining raw signals at the data acquisition level from several detector components. This approach achieves a high electron identification purity whilst retaining a 99.95% efficiency. The machine learning tools are capable of running at high rates exceeding the data acquisition rates and will allow electron reconstruction in real-time. This framework can then be expanded to other particle types. This work enhances online analyses and monitoring at CLAS12. Improved electron identification in the trigger also contributes to the reduction in recorded data volumes and improves data processing times. This approach to triggering will be employed when transitioning to higher luminosity experiments at CLAS12 where the data volume will increase significantly.
Ahead of Run 3 of the LHC, the trigger of the LHCb experiment was redesigned. The L0 hardware stage present in Runs 1 and 2 was removed, with detector readout at 30 MHz passing directly into the first stage of the software-based High Level Trigger (HLT), run on GPUs. Additionally, the second stage of the upgraded HLT makes extensive use of the Turbo event model, wherein only those candidates required for a trigger decision are saved. As the LHCb detector records only events selected by the trigger system, an absolute trigger efficiency cannot be evaluated. The TISTOS method provides a solution to this by evaluating the signal trigger efficiency on a trigger-selected sub-sample independent of signal. Events can be classified as having triggered on signal (TOS), triggered independent of signal (TIS), or both (TISTOS). Efficiencies are then calculated by a tag-and-probe approach, in which TIS and TISTOS events are used as tag and probe, respectively. This approach was applied successfully in Runs 1 and 2; however, in saving only candidates required for trigger decision, all such candidates are TOS by default. The TISTOS method has thus been specified in terms of the stage of selection below each stage of interest to define meaningful efficiencies. This contribution presents the development and performance of the TISTOS method for the upgraded trigger and event model, and an overview of the HLT trigger efficiencies evaluated in 2024 LHCb proton-proton collision data.
The evergrowing amounts of data produced by the high energy physics experiments create a need for fast and efficient track reconstruction algorithms. When storing all incoming information is not feasible, online algorithms need to provide reconstruction quality similar to their offline counterparts. To achieve it, novel techniques need to be introduced, utilizing acceleration offered by the highly parallel hardware platforms, like GPUs. Artificial Neural Networks are a natural candidate here, thanks to their good pattern recognition abilities, non-iterative execution, and easy implementation on hardware accelerators.
The MUonE experimenting, searching for the signs of New Physics in the sector of anomalous magnetic moment of a muon, is investigating the use of the machine learning techniques in data processing. Works related to the ML-based track reconstruction will be presented. The first attempt used deep multilayer perceptron network to predict parameters of the tracks in the detector. Neural network was used as the base of the algorithm that proved to be as accurate as the classical approach but substituting the tedious step of iterative CPU-based pattern recognition. Further works included implementation of the Graph Neural Network for classification of track segment candidates.
Offline Computing
Developments of the new Level-1 Trigger at CMS for the High-Luminosity Operation of the LHC are in full swing. The Global Trigger, the final stage of this new Level-1 Trigger pipeline, is foreseen to evaluate a menu of over 1000 cut-based algorithms, each of which targeting a specific physics signature or acceptance region. Automating the task of tailoring individual algorithms to specific physics regions would be a significant time saver while ensuring flexibility to adapt swiftly to evolving run conditions. This task essentially resembles a multi-objective optimization problem, where the goal is to strike a balance between the trigger rate and the trigger efficiency of the desired physics region.
We present the idea of leveraging achievement scalarization, a technique to turn the two objective functions into a scalar function with a minimum closest to a reference point chosen by a decision maker. An iterative gradient descent approach can then be employed to minimize this function, each iteration slightly modifying the cut parameters in the direction of descent. The decision maker in this context can be a single person designing parts of the menu or a collective group like CERN's data performance group agreeing on specific goals for upcoming data-taking sessions.
Preliminary results of using this procedure in targeting B meson decays have demonstrated promising outcomes. Ongoing efforts involve exploring alternative minimization techniques like evolutionary algorithms and extending the method to other physics signatures.
Searching for anomalous data is especially important in rare event searches like that of the LUX-ZEPLIN (LZ) experiment's hunt for dark matter. While LZ's data processing provides analyzer-friendly features for all data, searching for anomalous data after minimal reconstruction allows one to find anomalies which may not have been captured by reconstructed features and allows us to avoid any reconstruction errors. Autoencoders can be used to probe for anomalous light-detecting PMT waveforms resulting from ionization signals (S2) and have found unresolved S2s resulting from multiple scatter interactions. In addition to comparing results to waveform-shape template-fitting methods, these techniques can be extended by applying them to PMT waveforms from prompt scintillation light (S1) and S2 heatmaps which capture positional information. Results from such methods are discussed and compared to known anomalies.
Detecting Gamma-Ray Burst (GRB) signals from triggerless data poses significant challenges due to high noise levels, a problem similarly encountered in the Large High Altitude Air Shower Observatory's Water Cherenkov Detector Array (LHAASO-WCDA) triggerless data analysis. This research aims to enhance the GRB triggerless data algorithm which leverages the distinct spatial properties of gamma-ray showers. By incorporating advanced machine learning techniques such as Bayesian optimization, we refine the algorithm to more effectively detect GRB signals within noisy background signals. Preliminary findings indicate a marked improvement in the detection of GRB events, suggesting that machine learning methods can substantially enhance existing astrophysical data analysis techniques. These methods could lead to more accurate and reliable identification of GRB signals, thereby contributing to our understanding of these cosmic phenomena.
The BESIII at the BEPCII electron-positron accelerator, which is located at IHEP, Beijing, China, is an experiment for the studies of hadron physics and $\tau$-charm physics with the highest accuracy achieved until now. It has collected several world's largest $e^+e^-$ samples in $\tau$-charm region. Anomaly detection on BESIII detectors is an important segment of improving data quality, enhancing data acquisition efficiency and monitoring detectors' status. An offline unsupervised autoencoder-based anomaly detection method is applied on CsI(Tl) electromagnetic calorimeter (EMC). This method checks over histograms generated by each crystal using Jensen-Shannon Distance as loss function. Comparing to traditional method, this method is able to provide more accurate anomaly information with less manpower consuming.
During LHC High-Luminosity phase, the LHCb RICH detector will face challenges due to increased particle multiplicity and high occupancy. Introducing sub-100ps time information becomes crucial for maintaining excellent particle identification (PID) performance. The LHCb RICH collaboration plans to anticipate the introduction of timing through an enhancement program during the third LHC Long Shutdown. In the RICH detector, Cherenkov photons from a track arrive nearly simultaneously at the detector plane, allowing precise hit time prediction. The RICH reconstruction algorithm computes track and photon time-of-flight and estimates where photons are expected on the photodetector plane. Determining the primary vertex time (PV T$_0$) is crucial in predicting the time of arrival of photons on the photodetector plane. Adding time information allows applying a software time gate around the predicted time per track to enhance signal-to-background ratio and PID performance. This contribution describes how to estimate the PV T$_0$ using RICH information only, a novel approach for LHCb. The proposed algorithm computes a reconstructed PV time for every photon from hit time and tracking information. The PV T$_0$ is extracted by averaging this reconstructed time for all photons belonging to the PV. The challenge lies in correctly associate photons to their PV, which is a two-step process: PV-track and track-photon associations, both presenting inefficiencies. Results compare the estimated PV time resolution with Monte Carlo simulations. This contribution aims to describe the integration of fast-timing in the RICH detector, illustrating the impact of the PV time estimation method on PID performance.
The upcoming upgrades of LHC experiments and next-generation FCC (Future Circular Collider) machines will again change the definition of big data for the HEP environment. The ability to effectively analyse and interpret complex, interconnected data structures will be vital. This presentation will delve into the innovative realm of Graph Neural Networks (GNNs). This powerful tool extends traditional deep learning techniques to handle graph-structured data and may provide new and fast algorithms for track reconstruction in both the 3D and 4D domains.
Projecting challenging task of track reconstruction, especially challenging in harsh hadronic environment, into non-Euclidean domain of GNNs may leverage the intrinsic structure of graph data to extract addition crucial features and patterns that are either difficult or impossible for traditional statistical or intelligent reconstruction algorithms.
We present our initial studies using various GNN models implemented within the ACTS (A Common Tracking Software Project) framework. In our studies, we created a telescope detector that resembles an LHCb silicon vertex locator and used toy-generated data with truth information. Using such simulated setup, we were able to successfully train several GDN models to perform track reconstruction tasks. Based on these initial results, we performed preliminary studies to obtain efficiencies and resolutions for selected kinematical variables.
Our preliminary studies are very promising and show significant potential for using GDNs models as track reconstruction engines for future LHC upgrades and beyond.
Distributed Computing
Since 2017, the Worldwide LHC Computing Grid (WLCG) has been working towards enabling token-based authentication and authorization throughout its entire middleware stack.
Taking guidance from the WLCG Token Transition Timeline, published in 2022, substantial progress has been achieved not only in making middleware compatible with the use of tokens, but also in understanding the limitations of the WLCG Common JWT Profiles, first published in 2019. Significant scalability experience has been gained from Data Challenge 2024, during which millions of files were transferred with only tokens used as credentials.
Besides describing the state of affairs in the transition to tokens, revisions to the WLCG token profile, and the evolving roadmaps, this contribution also covers the corresponding transition from VOMS-Admin to INDIGO-IAM services, with continuing improvements in terms of functionality as well as deployment.
Created in 2023, the Token Trust and Traceability Working Group (TTT) was formed in order to answer questions of policy and best practice with the ongoing move from X.509 and VOMS proxy certificates to token-based solutions as the primary authorisation and authentication method in grid environments. With a remit to act in an investigatory and advisory capacity alongside other working groups in the token space, the TTT is composed of a broad variety of stakeholders to provide a breadth of experience and viewpoints. The requirements of grid sites, users, identity providers and virtual organisations to be able to trace workflows remain largely the same in a token paradigm as when using X.509 certificates, while tokens provide a new set of challenges, requiring a rethink and restructure of the policies and processes that were defined with just X.509 and VOMS in mind.
After providing an overview of the current status of the token trust landscape we will detail the initial findings, future plans and recommendations to be made by the TTT. This will include best practice for sites and identity providers, suggestions for token development, and methodologies for tracing token usage by system administrators within common grid middleware stacks.
Within the LHC community, a momentous transition has been occurring in authorization. For nearly 20 years, services within the Worldwide LHC Computing Grid (WLCG) have authorized based on mapping an identity, derived from an X.509 credential, or a group/role derived from a VOMS extension issued by the experiment. A fundamental shift is occurring to capabilities: the credential, a bearer token, asserts the authorizations of the bearer, not the identity.
By the HL-LHC era, the CMS experiment plans for the transition to tokens, based on the WLCG Common JSON Web Token profile, to be complete. Services in the technology architecture include the INDIGO Identity and Access Management server to issue tokens; a HashiCorp Vault server to store and refresh access tokens for users and jobs; a managed token bastion server to push credentials to the HTCondor CredMon service; and HTCondor to maintain valid tokens in long-running batch jobs. We will describe the transition plans of the experiment, current status, configuration of the central authorization server, lessons learned in commissioning token-based access with sites, and operational experience using tokens for both job submissions and file transfers.
Fermilab is the first High Energy Physics institution to transition from X.509 user certificates to authentication tokens in production systems. All of the experiments that Fermilab hosts are now using JSON Web Token (JWT) access tokens in their grid jobs. Many software components have been either updated or created for this transition, and most of the software is available to others as open source. The tokens are defined using the WLCG Common JWT Profile. Token attributes for all the tokens are stored in the Fermilab FERRY system which generates the configuration for the CILogon token issuer. High security-value refresh tokens are stored in Hashicorp Vault configured by htvault-config, and JWT access tokens are requested by the htgettoken client through its integration with HTCondor. The Fermilab job submission system jobsub was redesigned to be a lightweight wrapper around HTCondor. For automated job submissions a managed tokens service was created to reduce duplication of effort and knowledge of how to securely keep tokens active. The existing Fermilab file transfer tool ifdh was updated to work seamlessly with tokens, as well as the Fermilab POMS (Production Operations Management System) which is used to manage automatic job submission and the RCDS (Rapid Code Distribution System) which is used to distribute analysis code via the CernVM FileSystem. The dCache storage system was reconfigured to accept tokens for authentication in place of X.509 proxy certificates. As some services and sites have not yet implemented token support, proxy certificates are still sent with jobs for backwards compatibility but some experiments are beginning to transition to stop using them. There have been some glitches and learning curve issues but in general the system has been performing well and is being improved as operational problems are addressed.
INDIGO IAM (Identity and Access Management) is a comprehensive service that enables organizations to manage and control access to their resources and systems effectively. It implements a standard OAuth2 Authorization Service and OpenID Connect Provider and it has been chosen as the AAI solution by the WLCG community for the transition from VOMS proxy-based authorization to JSON web tokens.
This contribution describes the recent updates introduced by the latest IAM releases and the current roadmap for its evolution. In the near future, a primary focus is on avoiding to store access tokens in the database, to enhance the performance of both token issuance and token deletion. Another important milestone is the integration of a Multi-Factor Authentication mechanism. Additionally, substantial effort will be dedicated to migrating from outdated frameworks, such as MITREid Connect and AngularJS, to more stable and robust solutions based on Spring Security and React respectively. As a consequence, a new dashboard is also being developed, aligned with the latest advances in User Interface design.
This abstract highlights the progress made in the development roadmap described above, not forgetting the general auditing and performance improvements introduced with the latest releases or planned, such as the use of Open Policy Agent to re-implement the internal mechanism of the Scope Policy API.
X.509 certificates and VOMS proxies are still widely used by various scientific communities for authentication and authorization (authN/Z) in Grid Storage and Computing Elements. Although this has contributed to improve the scientific collaboration worldwide, X.509 authN/Z comes with some interoperability issues with modern Cloud-based tools and services.
The Grid computing communities have decided to migrate to token-based authentication, a new web technology that has proved to be flexible and secure.
The model being recently adopted by the communities is based on industrial standards such as OAuth2 and OpenID-Connect and exploits JSON Web Tokens (JWT): a compact way to securely transmit information as JSON objects.
JWT are usually short-lived and provide fine-grained authorization, based on "scopes", to perform specific actions.
These scopes are embedded into the token and are specified during the request procedure so they last only until token expiration time. Scopes can be requested based on user groups and permission thus providing the possibility of restricting a group to perform only a subset of actions.
These characteristics make up to a more secure alternative to X.509 proxies.
Being largely used in industries, JWTs are also easily integrated in services not specifically developed for the scientific community, such as calendars, Sync and Share services, collaborative software development platforms, and more.
As such, JWTs suit the many heterogeneous demands of Grid communities and some of them already started the transition in 2022.
In the Italian WLCG Tier-1, located in Bologna and managed by INFN - CNAF, several computing resources are hosted and made available to scientific collaborations in the fields of High-Energy Physics, Astroparticle Physics, Gravitational Waves, Nuclear Physics and many others.
Although LHC experiments at CERN are the main users of CNAF resources, many other communities and experiments are being supported in their computing activities.
While the main LHC experiments have already planned their own transition from X.509 to token-based authN/Z, many medium/small-sized collaborations struggle to put effort into it.
The Tier-1 User Support unit has the duty of guiding users towards efficient and modern computing techniques and workflows involving data and computing resources access.
As such, the User Support group is playing a central role in preparing documentation, tools and services to ease the transition from X.509 to JWTs.
The foreseen support strategy and the related tools will be presented. Future workflow plans in view of the complete transition will also be presented.
Simulation and analysis tools
As we are approaching the high-luminosity era of the LHC, the computational requirements of the ATLAS experiment are expected to increase significantly in the coming years. In particular, the simulation of MC events is immensely computationally demanding, and their limited availability is one of the major sources of systematic uncertainties in many physics analyses. The main bottleneck in the detector simulation is the detailed simulation of electromagnetic and hadronic showers in the ATLAS calorimeter system using Geant4.
In order to increase the MC statistics and to leverage the available CPU resources for LHC Run 3, the ATLAS collaboration has recently put into production a refined and significantly improved version of its state-of-the-art fast simulation tool AtlFast3. AtlFast3 uses classical parametric and machine learning based approaches such as Generative Adversarial Networks (GANs) for the fast simulation of LHC events in the ATLAS detector.
This talk will present the newly improved version of AtlFast3 that is currently in production for the simulation of Run 3 samples. In addition, ideas and plans for the future of fast simulation in ATLAS will also be discussed.
Detector simulation is a key component of physics analysis and related activities in CMS. In the upcoming High Luminosity LHC era, simulation will be required to use a smaller fraction of computing in order to satisfy resource constraints. At the same time, CMS will be upgraded with the new High Granularity Calorimeter (HGCal), which requires significantly more resources to simulate than the existing CMS calorimeters. This computing challenge motivates the use of generative machine learning models as surrogates to replace full physics-based simulation. We study the application of state-of-the-art diffusion models to simulate particle showers in the CMS HGCal. We will discuss methods to overcome the challenges posed by the high-dimensional, irregular geometry of the HGCal. The quality of the showers produced by the diffusion model will be assessed by comparison to the full GEANT4-based simulation. The increase in simulation throughput will be quantified and methods to accelerate the diffusion model inference will also be discussed.
In high energy physics, fast simulation techniques based on machine learning could play a crucial role in generating sufficiently large simulated samples. Transitioning from a prototype to a fully deployed model usable in a full scale production is a very challenging task.
In this talk, we introduce the most recent advances in the implementation of fast simulation for calorimeter showers in the LHCb simulation framework based on Generative AI. We use a novel component in Gaussino to streamline the incorporation of generic machine learning models. It leverages on the use of fast simulation hooks from Geant4 and machine learning backends such as PyTorch and ONNXRuntime.
Using this infrastructure the first implementation of selected ML models is trained and validated on the LHCb calorimeters. We will show a Variational Autoencoder (VAE) equipped with a custom sampling mechanism, as well as a transformer-based diffusion model (DiT). Both are compatible with the setup used in the CaloChallenge initiative, a collaborative effort aimed at training generic models for calorimeter shower simulation. We will share insights gained from the validation of these models on dedicated physics samples, including how to cope with handling and versioning multiple ML models in production in a distributed environment.
The event simulation is a key element for data analysis at present and future particle accelerators. We show [1] that novel machine learning algorithms, specifically Normalizing Flows and Flow Matching, can be effectively used to perform accurate simulations with several orders of magnitude of speed-up compared to traditional approaches when only analysis level information is needed. In such a case it is indeed feasible to skip the whole simulation chain and directly simulate analysis observables from generator information (end-to-end simulation). We simulate jets features to compare discrete and continuous Normalizing Flows models. The models are validated across a variety of metrics to select the best ones. We discuss the scaling of performance with the increase in training data, as well as the generalization power of these models on physical processes different from the training one. We investigate sampling multiple times from the same inputs, a procedure we call oversampling, and we show that it can effectively reduce the statistical uncertainties of a sample. This class of ML algorithms is found to be highly expressive and useful for the task of simulation. Their speed and accuracy, coupled with the stability of the training procedure, make them a compelling tool for the needs of current and future experiments.
[1] arXiv:2402.13684
Fast simulation of the energy depositions in high-granular detectors is needed for future collider experiments with ever increasing luminosities. Generative machine learning (ML) models have been shown to speed up and augment the traditional simulation chain. Many previous efforts were limited to models relying on fixed regular grid-like geometries leading to artifacts when applied to highly granular calorimeters with realistic cell layouts. We present CaloClouds III, a novel point cloud diffusion model that allows for high-speed generation of realistic electromagnetic showers due to the distillation into a consistency model. The model is conditioned on incident energy and impact angles and implemented into a realistic DD4hep based simulation model of the ILD detector concept for a future Higgs factory. This is done with the DDFastShowerML library which has been developed to allow for easy integration of generative fast simulation models into any DD4hep based detector model. With this it is possible to benchmark the performance of a generative ML model using fully reconstructed physics events by comparing them against the same events simulated with Geant4, thereby ultimately judging the fitness of the model for application in an experiment’s Monte Carlo.
Collaborative software and maintainability
The Key4hep software stack enables studies for future collider projects. It provides a full software suite for doing event generation, detector simulation as well as reconstruction and analysis. In the Key4hep stack, over 500 packages are built using the spack package manager and deployed via the cvmfs software distribution system. In this contribution, we explain the current setup for building nightly builds and stable releases that are made every few months or as needed. These builds are made available to users, who have access to a full and consistent software stack via a simple setup script. Different operating systems and compilers are supported and some utilities are provided to make development on top of the Key4hep builds easier. Both the benefits of the community-driven approach followed in spack and the issues found along the way are discussed.
The Spack package manager has been widely adopted in the supercomputing community as a means of providing consistently built on-demand software for the platform of interest. Members of the high-energy and nuclear physics (HENP) community, in turn, have recognized Spack’s strengths, used it for their own projects, and even become active Spack developers to better support HENP needs. Code development in a Spack context, however, can be challenging as the provision of external software via Spack must integrate with the developed packages’ build systems. Spack’s own development features can be used for this task, but they tend to be inefficient and cumbersome.
We present a solution pursued at Fermilab called MPD (multi-package development). MPD aims to facilitate the development of multiple Spack-based packages in concert without the overhead of Spack’s own development facilities. In addition, MPD allows physicists to create multiple development projects with an interface that insulates users from the many commands required to use Spack well.
The ePIC collaboration is working towards realizing the primary detector for the upcoming Electron-Ion Collider (EIC). As ePIC approaches critical decision milestones and moves towards future operation, software plays a critical role in systematically evaluating detector performance and laying the groundwork for achieving the scientific goals of the EIC project. The scope and schedule of the project require a balanced approach between near-term priorities, such as preparing the Technical Design Report, and long-term objectives for the future construction, commissioning, and operational phases. ePIC leverages an agile development process with high-level milestones to ensure continuous real-world testing of the software through monthly production campaigns and CI-driven benchmarks. The ePIC software stack embraces cutting-edge, sustainable community software tools and avoids the "not invented here" syndrome by building on top of well-supported and actively developed frameworks like the key4HEP stack (DD4hep, PODIO, EDM4hep) and ACTS. This collaborative development approach fosters an elevated standard of quality based on lessons learned by the nuclear physics and high energy physics communities. This talk will explore our setup for a collaborative development process and how it integrates with our vision for Software & Computing in the future ePIC experiment.
The ePIC collaboration is realizing the first experiment of the future Electron-Ion Collider (EIC) at the Brookhaven National Laboratory that will allow for a precision study of the nucleon and the nucleus at the scale of sea quarks and gluons through the study of electron-proton/ion collisions. This talk will discuss the current workflow in place for running centralized simulation campaigns for ePIC on the Open Science Grid infrastructure. This involves monthly releases of ePIC software and container deployments to CVMFS, generation of input datasets in HepMC format according to collaboration-defined policy, using Snakemake in CI/CD for validation and benchmarking, and submitting jobs to the Open Science Grid condor scheduler for opportunistic running on available resources. File transfers utilize XrootD, and RUCIO is used for data management. The workflow is being continuously refined to improve daily throughput (currently ~50-100k core hours per day) and minimize job failures. Since May 2023, monthly simulation campaigns employing the workflow have cumulatively used over ~10 million core hours on the Open Science Grid and produced over ~280 TB of simulation data. The campaigns incorporate simulations for the broad science program of the EIC and are actively used for the detector and physics studies in preparation of the Technical Design Review.
Considering CERN's prosperous environment, developing groundbreaking research in physics and pushing technology's barriers, CERN members participate in many talks and conferences every year. However, given that the ATLAS experiment has around 6000 members and more than one could be qualified to present the same talk, the experiment developed metrics to prioritize them.
Currently, ATLAS is organized in a tree structure with 260 groups and subgroups, called activities. Each of these activities has responsible members such as the conveners or sub-conveners, project leaders, and activity coordinators. Because of the tree structure mentioned, the member’s nomination will work its way up the branches, providing the upper levels with input from the lower ones. Previously, this process was not automated and happened through the exchange of CSVs, not providing these conveners and coordinators with the big picture of the nominations' priorities and reasons.
To improve this process, two systems were developed by the ATLAS Glance team: Activities and SCAB Nominations. The Activities interface provides a user-friendly view to manage the activities tree structure, the coordinators of each activity, and their allowed actions in the nomination process. The SCAB Nominations interface automates the nomination process of the ATLAS Speakers Committee Advisory Board, allowing all the coordinators to give their nominees priorities, and justify them in comments. These two systems contribute to a more holistic process for selecting collaboration members to present at a specific conference. This presentation delves into their specifications.
CERN has a very dynamic environment and faces challenges such as information centralization, communication between the experiments’ working groups, and the continuity of workflows. The solution found for those challenges is automation and, therefore, the Glance project, an essential management software tool for all four large LHC experiments. Its main purpose is to develop and maintain web-based automated solutions that are easy to learn and use and allow collaboration members to perform their tasks quickly.
The ATLAS Management Glance team is a subset of the Glance team focused on attending to the software requests of the ATLAS Spokesperson and deputies. The team maintains 11 systems that allow the management of ATLAS members, appointments, analyses, speaker nomination, and selection, among other tasks. Historically, each Glance developer would be an expert in the requirements of one or more systems, but their product management was inefficient, lacking the mapping of the product vision, goals, business rules, personas, and metrics. Also, the team's roadmap lacked predictability since it had no planned timeline.
In September 2023, the ATLAS Management Glance team adopted the Product Owner role concentrated in one single person recommended (or possibly “required”) by the Scrum Guide. This presentation dives into the challenges faced by the Glance Team Product Owner in establishing a strategy for effective product management and roadmap planning and key takeaways from that process.
Computing Infrastructure
This presentation delves into the implementation and optimization of checkpoint-restart mechanisms in High-Performance Computing (HPC) environments, with a particular focus on Distributed MultiThreaded CheckPointing (DMTCP). We explore the use of DMTCP both within and outside of containerized environments, emphasizing its application on NERSC Perlmutter, a cutting-edge supercomputing system. The discussion highlights the benefits of checkpoint-restart (C/R) techniques in managing complex, long-duration computations, showcasing the efficiency and reliability of these methods. Based on Geant4, a crucial tool for High Energy and Nuclear Physics, these techniques have been thoroughly tested and have passed the assessments. We further examine the integration of HPC containers, such as Shifter and Podman-HPC, which enhance computational task management and ensure consistent performance across various environments. Through real-world application examples, we illustrate the advantages of DMTCP in multi-threaded and distributed computing scenarios. Additionally we present the methods and results, demonstrating the impact of C/R on resource utilization, the future directions of this research, and its potential across various scientific domains.
The German university-based Tier-2 centres successfully contributed a significant fraction of the computing power required for Runs 1-3 of the LHC. But for the upcoming Run 4, with its increased need for both storage and computing power for the various HEP computing tasks, a transition to a new model becomes a necessity. In this context, the German community under the FIDIUM project is making interdisciplinary resources of the National High Performance Computing (NHR) usable within the WLCG and centralising mass storage at the Helmholtz centres.
The Goettingen campus hosts both a WLCG Tier-2 site, GoeGrid, and the HPC cluster Emmy that is part of the National High-Performance Computing (NHR) center NHR-Nord@Göttingen. The integration is done by virtually extending the GoeGrid batch system with containers, turning the HPC nodes into virtual worker nodes with their own partitionable job scheduling in order to run GoeGrid HEP jobs for the ATLAS collaboration. Submission and management of these containers are automated using COBalD (the Opportunistic Balancing Daemon) and TARDIS (The Transparent Adaptive Resource Dynamic Integration System). Data are provided via the GoeGrid mass storage for which a dedicated network connection has been established. Continuous production of ATLAS jobs is currently being tested in a one-year pilot phase. The setup, experience, performance tests and outlook are presented.
In a geo-distributed computing infrastructure with heterogeneous resources (HPC and HTC and possibly cloud), a key to unlock an efficient and user-friendly access to the resources is being able to offload each specific task to the best suited location. One of the most critical problems involve the logistics of wide-area with multi stage workflows back and forth multiple resource providers.
We envision a model where such a challenge can be addressed enabling a "transparent offloading” of containerized payloads using the Kubernetes API primitives creating a common cloud-native interface to access any number of external hardware machines and type of backends. Thus we created the interLink project, an open source extension to the concept of Virtual-Kubelet with a design that aims for a common abstraction over heterogeneous and distributed backends.
interLink is developed by INFN in the context of interTwin, an EU funded project that aims to build a digital-twin platform (Digital Twin Engine) for sciences, and the ICSC National Research Centre for High Performance Computing, Big Data and Quantum Computing in Italy. In this talk we first provide a comprehensive overview of the key features and the technical implementation. We showcase our major case studies such as the scale out of an analysis facility, and the distribution of ML training processes. We focus on the impacts of being able to seamlessly exploit world-class EuroHPC supercomputers with such a technology.
The MareNostrum 5 (MN5) is the new 750k-core general-purpose cluster recently deployed at the Barcelona Supercomputing Center (BSC). MN5 presents new opportunities for the execution of CMS data processing and simulation tasks but suffers from the same stringent network connectivity limitations as its predecessor, MN4. The innovative solutions implemented to navigate these constraints and effectively leverage the resources within the CMS distributed computing environment need to be revisited. First, the new worker nodes have increased their processor core count, and are thus capable of handling larger multicore CPU-bound CMS simulation tasks. Furthermore, the provisioning of larger disk storage capacity for MN5 broadens the spectrum of CMS workload types that can be accommodated at BSC. This storage space could, for example, be used to temporarily host large datasets required as input for CMS tasks, such as the pile-up samples, usually accessed by proton collision simulation jobs at runtime from remote grid sites’ storages. These tasks were previously unsuitable for execution, given the connectivity limitations from BSC to remote storages. Enhanced network bandwidth between MN5 and the Port d’Informació Cientifica (PIC) can also facilitate the expansion of BSC capabilities by provisioning input for CMS data processing tasks at BSC, thus expanding the role of this resource in the CMS computing landscape. This contribution will provide an overview of the commissioning efforts and the results of the subsequent exploitation of MN5 for CMS, showcasing the new transformative capacities introduced by the MN5 cluster.
The CMS experiment's operational infrastructure hinges significantly on the CMSWEB cluster, which serves as the cornerstone for hosting a multitude of services critical to the data taking and analysis. Operating on Kubernetes ("k8s") technology, this cluster powers over two dozen distinct web services, including but not limited to DBS, DAS, CRAB, WMarchive, and WMCore.
In this talk, we propose and develop an application which is specifically tailored to the task of anomaly detection within this ecosystem of services. The core approach involves harnessing the capabilities of machine/deep learning methods, alongside a comprehensive exploration of various service parameters, to identify irregularities and potential threats effectively. The application is designed with the goal that continually monitors these services for any deviations from their expected behavior. Leveraging diverse machine/deep learning techniques and scrutinizing service-specific parameters, the application will be equipped to discern anomalies and aberrations that might signify security breaches or performance issues. Once an anomaly is detected, the system will not only record this event but will also promptly generate alerts. These alerts will be intelligently routed to the relevant service developers or administrators responsible for maintaining the affected components. This proactive alerting mechanism ensures that any emerging issues are swiftly addressed, minimizing potential disruptions and fortifying the overall reliability of the CMSWEB cluster and its critical services.
The efficient utilization of multi-purpose HPC resources for High Energy Physics applications is increasingly important, in particularly with regard to the upcoming changes in the German HEP computing infrastructure.
In preparation for the future, we are developing and testing an XRootD-based caching and buffering approach for workflow and efficiency optimizations to exploit the full potential of such resources despite the challenges and potential limitations associated with them.
With this contribution, we want to present a first prototype of our approach, deployed for optimizing the utilization of HoreKa, our local HPC cluster at KIT, that is opportunistically integrated into GridKa, the German Tier-1 center.
This includes first experiences and additional benefits for the operation of such sites that come with the additional monitoring capabilities of our setup.
Meeting point: in front of the venue, the Auditorium Maximum of Jagiellonian University - Krupnicza 33 Street.
Meeting point: in front of the venue, the Auditorium Maximum of Jagiellonian University - Krupnicza 33 Street.
The Dirac interware has long served as a vital resource for user communities seeking access to distributed computing resources. Originating within the LHCb collaboration around 2000, Dirac has undergone significant evolution. A pivotal moment occurred in 2008 with a major refactoring, resulting in the development of the experiment-agnostic core Dirac, which paved the way for customizable extensions like LHCbDirac and BelleDirac, among others.
Despite its efficacy in meeting experiment-specific requirements, Dirac has accrued technical debt over its 15-year history. Installation management remains intricate, with significant entry barriers and a reliance on bespoke infrastructure. Additionally, the software development process lacks alignment with contemporary standards, impeding the onboarding process for new developers. Notably, integral components such as the network protocol and authentication mechanisms are proprietary and pose challenges for seamless integration with external applications.
In response to these challenges, the Dirac consortium has embarked on the development of DiracX. Drawing upon two decades of experience and battle-tested technological frameworks, DiracX heralds a new era in distributed computing solutions. This contribution describes technical decisions, roadmap and timelines for the development of DiracX.
This article presents an overview of the architecture underpinning DiracX, shedding light on the technological decisions guiding its development. Recognizing the criticality of maintaining a continuously operational Dirac system for numerous user communities, we delve into the intricacies of the migration process from Dirac to DiracX.
The ATLAS Google Project was established as part of an ongoing evaluation of the use of commercial clouds by the ATLAS Collaboration, in anticipation of the potential future adoption of such resources by WLCG grid sites to fulfil or complement their computing pledges. Seamless integration of Google cloud resources into the worldwide ATLAS distributed computing infrastructure was achieved at large scale and for an extended period of time, and hence cloud resources are shown to be an effective mechanism to provide additional, flexible computing capacity to ATLAS. For the first time a Total Cost of Ownership analysis has been performed, to identify the dominant cost drivers and explore effective mechanisms for cost control. Network usage significantly impacts the costs of certain ATLAS workflows, underscoring the importance of implementing such mechanisms. Resource bursting has been successfully demonstrated, whilst exposing the true cost of this type of activity. A follow-up to the project is underway to investigate methods for improving the integration of cloud resources in data-intensive distributed computing environments and reducing costs related to network connectivity, which represents the primary expense when extensively utilising cloud resources.
The metadata schema for experimental nuclear physics project aims to facilitate data management and data publication under the FAIR principles in the experimental Nuclear Physics communities, by developing a cross-domain metadata schema and generator, tailored for diverse datasets, with the possibility of integration with other, similar fields of research (i.e. Astro and Particle physics).
Our project focuses on creating a standardized, adaptable framework that enhances data Findability, Accessibility, Interoperability, and Reusability (FAIR principles). By creating a comprehensive and adaptable metadata schema, the project ensures scalable integration of both machine and human-readable metadata, thereby improving the efficiency of data discovery and utilization.
A pivotal component of the project is its nodal, multi-layered schema structure, allowing metadata enrichment from multiple domains while maintaining essential overlaps for enhanced versatility. This comprehensive approach supports the unification of data standards across various research institutions, promoting interoperability and collaboration on a European scale. Our efforts also extend to the development of a user-friendly frontend generator, designed not only to facilitate metadata input but also to allow users to specify field-specific attributes, customize generic names to suit their needs, and export schemas in various formats such as JSON and XML, adhering to different nomenclatures.
The project involves world-class RIs and ESFRIs, and leverages synergies from existing Open Science initiatives like EOSC, ESCAPE, EURO-LABS, and PUNCH4NFDI. In this contribution, we will present an overview of the project, detailing the development steps, key features of the metadata schema, and the functionality of the frontend generator.
For several years, the ROOT team is developing the new RNTuple I/O subsystem in preparation of the next generation of collider experiments. Both HL-LHC and DUNE are expected to start data taking by the end of this decade. They pose unprecedented challenges to event data I/O in terms of data rates, event sizes and event complexity. At the same time, the I/O landscape is getting more diverse. HPC cluster file systems and object stores, NVMe disk cache layers in analysis facilities, and S3 storage on cloud resources are mixing with traditional XRootD-managed spinning disk pools.
The ROOT team will finalize a first production version of the RNTuple binary format by the end of the year. After this point, ROOT will provide backwards compatibility for RNTuple data. This contribution provides an overview of the RNTuple feature set, the related R&D activities, and the long-term vision for RNTuple. We report on performance, interface design, tooling, robustness, integration with experiment frameworks, and validation results as well as recent R&D on parallel reading and writing and exploitation of modern hardware and storage systems. We will give an outlook on possible future features after a first production release.
Collaboratively, the IT and EP departments have launched a formal project within the Research and Computing sector to evaluate the novel data format for physics analysis data utilized in LHC experiments and other fields. This aspect of the project focuses on verifying the scalability of the storage back-end EOS during the migration from TTree to RNTuple, utilizing replicated and erasure coded profiles.
During Run-3 the Large Hadron Collider (LHC) experiments are transferring up to 10PB of data daily across the Worldwide LHC Computing Grid (WLCG) sites. However, following the transition from Run-3 to Run-4, data volumes are expected to increase tenfold. The WLCG Data Challenge aims to address this significant scaling challenge through a series of rigorous test events.
The primary objective of the 2024 Data Challenge (DC24) was to achieve 25% of the anticipated bulk transfer rate required for Run-4. Six experiments participated: the four LHC experiments—ATLAS, CMS, LHCb, and ALICE—as well as Belle II and DUNE. These experiments utilize the same networks, many of the same sites and the data management tools, that will be employed in Run-4. Additionally, DC24 aimed to test new technologies such as token-based authorization and advanced network monitoring tools.
The direct benefits of DC24 included identifying bottlenecks within the centralized data management systems of each experiment, gaining experience with significantly higher data transfer rates, and fostering significant collaboration among experiments and stakeholders. These stakeholders encompassed site administrators, storage technology providers, network experts, and middleware tool developers, all contributing to the preparedness for the demands of Run-4.
Back in the late 1990’s when planning for LHC computing started in earnest, arranging network connections to transfer the huge LHC data volumes between participating sites was seen as a problem. Today, 30 years later, the LHC data volumes are even larger, WLCG traffic has switched from a hierarchical to a mesh model and yet almost nobody worries about the network.
Some people still do worry, however. Even if LHC data transfers still account for over 50% of NREN traffic, other data-intensive experiments are coming on stream and network engineers worry about managing the overall traffic efficiently.
We present here the challenges likely to be keeping network engineers busy in the coming decade: how to monitor traffic from different communities, how to avoid congestion over transoceanic links; how to smooth traffic flows to maximise throughput, hand-over of large flows at interconnection points; cyber security and more.
Data and Metadata Organization, Management and Access
The CMS experiment manages a large-scale data infrastructure, currently handling over 200 PB of disk and 500 PB of tape storage and transferring more than 1 PB of data per day on average between various WLCG sites. Utilizing Rucio for high-level data management, FTS for data transfers, and a variety of storage and network technologies at the sites, CMS confronts inevitable challenges due to the system’s growing scale and evolving nature. Key challenges include managing transfer and storage failures, optimizing data distribution across different storages based on production and analysis needs, implementing necessary technology upgrades and migrations, and efficiently handling user requests. The data management team has established comprehensive monitoring to supervise this system and has successfully addressed many of these challenges. The team’s efforts aim to ensure data availability and protection, minimize failures and manual interventions, maximize transfer throughput and resource utilization, and provide reliable user support. This paper details the operational experience of CMS with its data management system in recent years, focusing on the encountered challenges, the effective strategies employed to overcome them and the ongoing challenges as we prepare for future demands.
The Deep Underground Neutrino Experiment (DUNE) is scheduled to start running in 2029, expected to record 30 PB/year of raw data. To handle this large-scale data, DUNE has adopted and deployed Rucio, the next-generation Data Replica service originally designed by the ATLAS collaboration, as an essential component of its Distributed Data Management system.
DUNE's use of Rucio has demanded the addition of various features to the Rucio code base, both specific functionality for DUNE alone, and more general functionality that is crucial for DUNE whilst being potentially useful for other experiments. As part of our development work, we have introduced a "policy package" system allowing experiment-specific code to be maintained separately from the core Rucio code, as well as creating a DUNE policy package containing algorithms such as logical to physical filename translation, and special permission checks. We have also developed other features such as improved object store support, and customisable replica sorting. A DUNE-specific test suite that will run on GitHub Actions is currently under development.
Recently, DUNE has deployed new internal monitoring to Rucio, enabling us to extract more useful information from core Rucio servers, and daemons such as transmogrifier, reaper, etc. Additionally, DUNE has implemented monitoring for Rucio transfer and deletion activities which are sent to a Message Queue via Rucio Hermes daemon. Information such as data location, accounting, and storage summary is extracted from the Rucio internal database and dumped into Elasticsearch for visualisation. The visualisation platforms utilised are based at Fermilab and Edinburgh. This monitoring is crucial for the ongoing DUNE data transfers and management development.
The File Transfer Service (FTS) is a bulk data mover responsible for queuing, scheduling, dispatching and retrying file transfer requests, making it a critical infrastructure component for many experiments. FTS is primarily used by the LHC experiments, namely ATLAS, CMS and LHCb, but is also used by some non-LHC experiments, including both AMS and DUNE. FTS is as an essential part in the data movement pipeline for these experiments and is responsible for moving their data across the world via the worldwide LHC computing Grid (WLCG).
The Square Kilometre Array (SKA) is a multi-purpose radio telescope that will play a major role in answering key questions in modern astrophysics and cosmology. The SKA will have a survey speed a hundred times that of current radio telescopes and its capabilities will allow transformational experiments to be conducted in a wide variety of science areas. Whilst the headquarters for this project is located at Jodrell Bank in the UK, the main telescope sites are located in South Africa and Australia. The two telescope sites will produce approximately 700 PB of data per year, which will need to be moved to one of the SKA regional centres located in member countries around the world to be stored, before being accessed by scientists. It is evident that there will be several similarities between the computing requirements for the LHC and SKA experiments, in particular the challenges posed by moving large quantities of data around a global network.
In this talk, we will discuss the usage of FTS by SKA and its ability to enable long-range data transfer across the developing SKA regional centre network of sites. We will also discuss some alterations to the FTS service ran at STFC to better support SKA, most notably the migration to token based authentication away from X509 certificates.
Modern physics experiments are often led by large collaborations including scientists and institutions from different parts of the world. To cope with the ever increasing computing and storage demands, computing resources are nowadays offered as part of a distributed infrastructure. Einstein Telescope (ET) is a future third-generation interferometer for gravitational wave (GW) detection, and is currently in the process of defining a computing model to sustain ET physics goals. A critical challenge for present and future experiments is an efficient and reliable data distribution and access system. Rucio is a framework for data management, access and distribution. It was originally developed by the ATLAS experiment and has been adopted by several collaborations within the high energy physics domain (CMS, Belle II, Dune) and outside (ESCAPE, SKA, CTA). In the GW community Rucio is used by the second-generation interferometers LIGO and Virgo, and is currently being evaluated for ET. ET will observe a volume of the Universe about one thousand times larger than LIGO and Virgo, and this will reflect on a larger data acquisition rate. In this contribution, we briefly describe Rucio usage in current GW experiments, and outline the on-going R&D activities for integration of Rucio within the ET computing infrastructure, which include the setup of an ET Data Lake based on Rucio for future Mock Data Challenges. We discuss the customization of Rucio features for the GW community: in particular we describe the implementation of RucioFS, a POSIX-like filesystem view to provide the user with a more familiar structure of the Rucio data catalogue, and the integration of the ET Data Lake with mock Data Lakes belonging to other experiments within the astrophysics and GW communities. This is a critical feature for astronomers and GW data analysts since they often require access to open data from other experiments for sky localisation and multi-messenger analysis.
The set of sky images recorded nightly by the camera mounted on the telescope of the Vera C. Rubin Observatory will be processed in facilities located on three continents. Data acquisition will happen in Cerro Pachón in the Andes mountains in Chile where the observatory is located. A first copy of the raw image data set is stored at the summit site of the observatory and immediately transferred through dedicated network links to the archive site and US Data Facility hosted at SLAC National Laboratory in California, USA. After an embargo period of a few days, the full image set is copied to the UK and French Data Facilities where a third copy is located.
During its 10 years in operation starting late 2025, annual processing campaigns across all images taken to date will be jointly performed by the three facilities, involving sophisticated algorithms to extract the physical properties of the celestial objects and producing science-ready images and catalogs. Data products resulting from the processing campaigns at each facility will be sent to SLAC and combined to create a consistent Data Release which is served to the scientific community for its science studies via Data Access Centers in the US and Chile and Independent Data Access Centers elsewhere.
In this contribution we present an overall view of how we leverage the tools selected for managing the movement of data among the Rubin processing and serving facilities, including Rucio and FTS3. We will also present the tools we developed to integrate Rucio’s data model and Rubin’s Data Butler, the software abstraction layer that mediates all access to storage by the pipeline tasks which implement the science algorithms.
The Belle II raw data transfer system is responsible for transferring raw data from the Belle II detector to the local KEK computing centre, and from there to the GRID. The Belle II experiment recently completed its first Long Shutdown period - during this time many upgrades were made to the detector and tools used to handle and analyse the data. The Belle II data acquisition (DAQ) systems received significant improvements, necessitating changes in the processing steps for raw data. Furthermore, experience gained during Run 1 identified areas where the scalability of the system could be improved to better handle the expected increase in data rates in future years.
To address these issues, extensive upgrades were made to the raw data transfer system, including: utilisation of the DIRAC framework for all data transfers; a change in the protocol used to communicate with the DAQ systems; and retirement of the previously used file format conversion component of the system. This talk will describe these changes and improvements in detail, and give an overview of the current state of the Belle II raw data transfer system.
Online and real-time computing
The Mu3e experiment at the Paul-Scherrer-Institute will be searching for the charged lepton flavor violating decay $\mu^+ \rightarrow e^+e^-e^+$. To reach its ultimate sensitivity to branching ratios in the order of $10^{-16}$, an excellent momentum resolution for the reconstructed electrons is required, which in turn necessitates precise detector alignment. To compensate for weak modes in the main alignment strategy based on electrons and positrons from muon decays, the exploitation of cosmic ray muons is proposed.
The trajectories of cosmic ray muons are so different from the decays of stopped muons in the experiment that they cannot be reconstructed using the same method in the online filter farm. For this reason and in view of their comparatively rare occurrence, a special cosmic muon trigger is being developed. A study on the application of graph neural networks to classify events and to identify cosmic muon tracks will be presented.
The increasing complexity and data volume of Nuclear Physics experiments require significant computing resources to process data from experimental setups. The entire experimental data set has to be processed to extract sub-samples for physics analysis. The advancements in Artificial Intelligence and Machine Learning fields provide tools and procedures that can significantly enhance the throughput of data processing and significantly reduce the computational resources needed to process and categorize the experimental data in the raw data stream. In CLAS12 machine learning methods are developed to perform track reconstruction in real-time, allowing the identification of physics reactions from the raw data stream with the rates exceeding the data acquisition rates. In this paper, we present the Neural Network-driven track reconstruction that allows event classification and physics analysis in real time. We present a complete physics analysis of the data processed in the online.
The reconstruction of charged particle trajectories in tracking detectors is crucial for analyzing experimental data in high-energy and nuclear physics. Processing of the vast amount of data generated by modern experiments requires computationally efficient solutions to save time and resources. In response, we introduce TrackNET, a recurrent neural network specifically designed for track recognition in pixel and strip-based particle detectors. TrackNET acts as a scalable alternative to the Kalman filter, exemplifying local tracking methods by independently processing each track-candidate. We rigorously tested TrackNET using the TrackML dataset and simulated data from the straw tracker of the SPD experiment at JINR, Dubna. Our results demonstrate significant improvements in processing speed and accuracy. The paper concludes with a comprehensive analysis of TrackNET's performance and a discussion on its limitations and potential enhancements.
Tracking charged particles resulting from collisions in the presence of strong magnetic field is an important and challenging problem. Reconstructing the tracks from the hits created by those generated particles on the detector layers via ionization energy deposits is traditionally achieved through Kalman filters that scale worse than linearly as the number of hits grow. To improve efficiency there is a need for developing new tracking methods. Machine Learning (ML) has been leveraged in several science applications for both speedups and improved results. To this line, a class of ML algorithms called Graph Neural Networks (GNNs) are explored for charged particle tracking. Each event in the particle tracking data naturally imposes itself as a graph structure with the event hits represented as graph nodes while track segments are represented as a subset of the graph edges that need to be correctly classified by the ML algorithm. We compare three different approaches for tracking at GlueX experiment at Jefferson Lab, namely traditional track finding, GPU-based GNN, and FPGA-based GNN. The comparison is held in terms of inference time and performance results. Beside presenting data processing, graph construction, and the used GNN model, we provide insight into resolving the missing hits issue for GNN training and evaluation. We show that the GNN model can achieve significant speedup by processing multiple events in batches which exploits the high parallel computation capability of GPUs. We present results on real GlueX data in addition to the collective results of the simulation data.
The ALICE Time Projection Chamber (TPC) is the detector with the highest data rate of the ALICE experiment at CERN and is the central detector for tracking and particle identification. Efficient online computing such as clusterization and tracking are mainly performed on GPU's with throughputs of approximately 900 GB/s. Clusterization itself has a well known background with a variety of algorithms in the field of machine learning. This work investigates a neural network approach to cluster rejection and regression on a topological basis. Central to its task are the center-of-gravity, sigma and total charge estimation as well as rejection of clusters in the TPC readout. Additionally, a momentum vector estimate is made from the 3D input across readout rows in combination with reconstructed tracks which can benefit track seeding. Performance studies on inference speed as well as model architectures and physics performance on Monte-Carlo data can be presented, showing that tracking performance can be maintained while rejecting 5-10% of raw clusters with a O(30%) reduced fake-rate for clusterization itself compared to the current GPU clusterizer.
Polarized cryo-targets and polarized photon beams are widely used in experiments at Jefferson Lab. Traditional methods for maintaining the optimal polarization involve manual adjustments throughout data taking-- an approach that is prone to inconsistency and human error. Implementing machine learning-based control systems can improve the stability of the polarization without relying on human intervention. The cryo-target polarization is influenced by temperature, microwave energy, the distribution of paramagnetic radicals, as well as operational conditions including the radiation dose. Diamond radiators are used to generate linearly polarized photons from a primary electron beam. The energy spectrum of these photons can drift over time due to changes in the primary electron beam conditions and diamond degradation. As a first step towards automating the continuous optimization and control processes, uncertainty aware surrogate models have been developed to predict the polarization based on historical data. This talk will provide an overview of the use cases and models developed, highlighting the collaboration between data scientists and physicists at Jefferson Lab.
Offline Computing
Jet reconstruction remains a critical task in the analysis of data from HEP colliders. We describe in this paper a new, highly performant, Julia package for jet reconstruction, JetReconstruction.jl
, which integrates into the growing ecosystem of Julia packages for HEP. With this package users can run sequential reconstruction algoritms for jets, In particular, for LHC events, the Anti-$\mathrm{k_T}$, Cambridge/Aachen and Inclusive $\mathrm{k_T}$ algorithms can be used. For FCCee studies the use of alternative algorithms such as the generalised ee-$\mathrm{k_T}$ and Durham are also supported.
The full reconstruction history is made available, allowing inclusive and exclusive jets to be retrieved. The package also provides the means to visualise the reconstruction.
The implementation of the package in Julia is discussed, with an emphasis on the features of the language that allow for an easy to work with, ergonomic, code implementation, that achieves high-performance. Julia's ecosystem offers the possibility to vectorise code, using single-instruction-multiple-data processing, in way that is transparent for the developer and more flexible than optimization done via C and C++ compilers. Thanks to this feature, the performance of JetReconstuction.jl
is better than the current Fastjet C++ implementation in jet clustering for p-p events produced at the LHC.
Finally, an example of an FCCee analysis using JetReconstruction.jl
is shown.
Key4hep, a software framework and stack for future accelerators, integrates all the steps in the typical offline pipeline: generation, simulation, reconstruction and analysis. The different components of Key4hep use a common event data model, called EDM4hep. For reconstruction, Key4hep leverages Gaudi, a proven framework already in use by several experiments at the LHC, to orchestrate configuration and execution of reconstruction algorithms.
In this contribution, a brief overview of Gaudi is given. The specific developments built to make Gaudi work seamlessly with EDM4hep (and therefore in Key4hep) are explained, as well as other improvements requested by the Key4hep community. The list of developments includes a new IO service to run algorithms that read or write EDM4hep files in multithreading in a thread-safe way and a possibility to easily switch the EDM4hep I/O to the new ROOT RNTuple format for reading or writing. We show that both native (algorithms that use EDM4hep as input and output) and non-native algorithms from the ILC community can run together in Key4hep, picking up on knowledge and software developed over many years. A few examples of algorithms that have been created or ported to Key4hep recently are given, featuring the usage of Key4hep-specific features.
LUX-ZEPLIN (LZ) is a dark matter direct detection experiment using a dual-phase xenon time projection chamber with a 7-ton active volume. In 2022, LZ collaboration published a world leading limit on WIMP dark matter interactions with nucleons. The success of the LZ experiment hinges both on the resilient design of its hardware and software infrastructures. This talk will give an overview of the offline software infrastructure of the LZ experiment, which includes the automated movement of the data and real time processing at NERSC, using its foremost HPC machine, Perlmutter. Additionally, I will talk about the monitoring tools and web services that enable the management, and operation of LZ’s data workflow and cataloging.
A Common Tracking Software (ACTS) is an open-source experiment-independent and framework-independent tracking software for both current and future particle and nuclear physics experiments. It provides a set of high-level detector-agnositic track reconstruction tools, which are intially developed and validated with a few example detectors, e.g. the Open Data Detector (ODD), which is an open-access HL-LHC style detector for algorithmic development and benchmarking, and has been well integrated and supported within the ACTS toolkit. Current implementation of the ODD includes a full silicon-based tracking system and calorimetry.
So far, ACTS has already been deployed for data production at experiments such as ATLAS, sPHENIX, FASER etc., where the applications are focusing on silicon-based trackers. Recently, ACTS has been successfully extended to gaseous trackers and the new development is being validated with uRWell-based tracker and drift chambers for future HEP experiments such as Super Tau Charm Facility (STCF) and Circular Electron Positron Collider (CEPC).
In this contribution, we will introduce the progress we have made in implementing drift chamber as a sub-detector of the ODD. With the newly added drift chamber, the ODD can be configured either as a full silicon tracker or a combined tracker consisting of a pixel tracker and a drift chamber. The tracking performance of ODD with the two detector configurations are studied using ACTS. Our experience with ODD and ACTS hopefully will be beneficial to the potential ACTS clients who are interested to evaluate the performance of ACTS algorithms.
ACTS is an experiment independent toolkit for track reconstruction, which is designed from the ground up for thread-safety and high performance. It is built to accommodate different experiment deployment scenarios, and also serves as community platform for research and development of new approaches and algorithms.
A fundamental component of ACTS is the geometry library. It models a simplified representation of a detector, compared to simulation geometries. It drives the numerical track extrapolation, provides crucial inputs to track finding and fitting algorithms, and is connected to many other geometry libraries in the ecosystem, shipping with multiple plugins.
ACTS’ geometry library is historically optimized for symmetric, collider-like detectors and most suitable for arrangements of silicon sensors. An effort has been underway for some amount of time to rewrite large parts of the geometry code.
The goal is to be more flexible to accommodate other detector approaches and simplify the building process, while providing easy conversion to a GPU-optimized geometry for use with the detray library. Another goal is to allow for a more systematic way to write geometry plugins.
Finally, the navigation logic is delegated to detector regions, so that it can be easily extended for unconventional environments.
This contribution reports on the result of this rewrite, discusses lessons learned from the project and how they were incorporated into a robust geometry modeling solution in ACTS that will be key going forward.
To increase the automation to convert Computer-Aided-Design detector components as well as entire detector systems into simulatable ROOT geometries, TGeoArbN, a ROOT compatible geometry class, was implemented allowing the use of triangle meshes in VMC-based simulation. To improve simulation speed a partitioning structure in form of an Octree can be utilized. TGeoArbN in combination with a CADToROOT-Converter (based on [1]) allowed e.g. for a high level of automation for the conversion of the forward endcap geometry of the PANDA electromagnetic calorimeter.
The aim of the talk is to give an overview on TGeoArbN and the modified CADToROOT-Converter version.
[1] T. Stockmanns, "STEP-to-ROOT – from CAD to Monte Carlo Simulation",
Journal of Physics: Conference Series 396 (2012) 022050,
url: https://doi.org/10.1088/1742-6596/396/2/022050
Distributed Computing
The CMS computing infrastructure spread globally over 150 WLCG sites forms a intricate ecosystem of computing resources, software and services. In 2024, the production computing cores breached half a million mark and storage capacity is at 250 PetaBytes on disk and 1.20 ExaBytes on Tape. To monitor these resources in real time, CMS working closely with CERN IT has developed a multifaceted monitoring system providing real time insights using about 100 production dashboards.
In preparation of Run3, the CMS monitoring infrastructure underwent significant evolution to broaden the scope of monitored applications and services while enhancing sustainability and ease of operation. Leveraging open-source solutions, provided either by the CERN IT department or managed internally, monitoring applications have transitioned from bespoke solutions to standardized data flow and visualization services. Notably, monitoring applications for distributed workload management and data handling have migrated to utilize technologies like OpenSearch, VictoriaMetrics, InfluxDB, and HDFS, with access facilitated through programmatic APIs, Apache Spark, or Sqoop jobs, and visualization primarily via Grafana.
The majority of CMS monitoring applications are now deployed on Kubernetes clusters based microservices architecture. This contribution unveils the comprehensive stack of CMS monitoring services, showcasing how the integration of common technologies enables versatile monitoring applications and addresses the computation demands of LHC Run 3. Additionally, it explores the incorporation of analytics into the monitoring framework, demonstrating how these insights contribute to the operational efficiency and scientific output of the CMS experiment.
JAliEn, the ALICE experiment's Grid middleware, utilizes whole-node scheduling to maximize resource utilization from participating sites. This approach offers flexibility in resource allocation and partitioning, allowing for customized configurations that adapt to the evolving needs of the experiment. This scheduling model is gaining traction among Grid sites due to its initial performance benefits. Additionally, understanding common execution patterns for different workloads allows for more efficient scheduling and resource allocation strategies.
However, managing the entire set of resources on a node requires careful orchestration. JAliEn employs custom mechanisms to dynamically allocate idle resources to running workloads, ensuring overall resource usage stays within the node's capacity.
This paper evaluates the experiences of the first sites using whole-node scheduling. It highlights its suitability for accommodating jobs with varying resource demands, particularly those with high memory requirements.
Job pilots in the ALICE Grid have become increasingly tasked with how to best manage the resources given to each job slot. With the emergence of more complex and multicore oriented workflows, this has since become an increasingly challenging process, as users often request arbitrary resources, in particular CPU and memory. This is further exacerbated by often having several user payloads running in parallel in the same slot, and with useful management utilities generally needing elevated privileges to function.
To alleviate resource management within each given job slot, the ALICE Grid has begun utilising novel features introduced in later Linux kernels, such as Cgroups v2, to provide means for fine-grained resource controls. By allowing specific controllers to be delegated down a Cgroup hierarchy, it enables users to access and tune these resource controls as needed - unprivileged. When further used in conjunction with the ALICE job pilot, it enables each job slot to be subpartitioned. In turn, allowing the pilot to act as its own local resource management system in its given slot - with a full “box-in” of each subjob to its own subset of the given resources.
This contribution describes the updated ALICE job pilot and its management and delegation process. Specifically, how it utilises kernel features to create individual resource groups for its jobs, while accommodating for the variety of configurations and computing elements used across participating sites - enabling these features to be used across the ALICE Grid.
The Unified Experiment Monitoring (UEM) is the project in WLCG with the objective to harmonise the WLCG job accounting reports across the LHC experiments, in order to provide aggregated reports of the compute capacity used by WLCG along time. This accounting overview of all LHC experiments is vital for the strategy planning of WLCG and therefore it finds the strong support of the LHC Committee (LHCC). However, creating common overviews is challenging, due to the different internals of each experiment monitoring system and also due to the long time scale of the reports to cover at least a decade of data. These monitoring systems evolved largely independently over time, implying that the UEM project has to design and implement different approaches to couple the multiple data sources within the CERN IT monitoring tools which will be used. Last but not least, the different terminologies have to be aligned into a useful and coherent set. This contribution will drive the audience through the motivations of the project, the challenges faced, the design adopted to overcome them, and the presentation of the state of the art.
The risk of cyber attack against members of the research and education sector remains persistently high, with several recent high visibility incidents including a well-reported ransomware attack against the British Library. As reported previously, we must work collaboratively to defend our community against such attacks, notably through the active use of threat intelligence shared with trusted partners both within and beyond our sector.
We discuss the development of capabilities to defend sites across the WLCG and other research and education infrastructures, with a particular focus on sites other than Tier1s which may have fewer resources available to implement full-scale security operations processes. These capabilities include a discussion of the pDNSSOC software which enables a lightweight and flexible means to correlate DNS logs with threat intelligence, and an examination of the use of Endpoint Detection and Response (EDR) tools in a high throughput context.
This report will include an important addition to the work of the Security Operations Centre Working Group; while this group had previously focused primarily on the technology stacks appropriate for use in deploying fine-grained security monitoring services, the people and processes involved with such capabilities are equally important.
Defending as a community requires a strategy that brings people, processes and technology together. We suggest approaches to support organisations and their computing facilities to defend against a wide range of threat actors. While a robust technology stack plays a significant role, it must be guided and managed by processes that make their cybersecurity strategy fit their environment.
GlideinWMS has been one of the first middleware in the WLCG community to transition from X.509 to support also tokens. The first step was to get from the prototype in 2019 to using tokens in production in 2022. This paper will present the challenges introduced by the wider adoption of tokens and the evolution plans for securing the pilot infrastructure of GlideinWMS and supporting the new requirements.
In the last couple of years, the GlideinWMS team supported the migration to tokens of experiments and resources. Inadequate support in the current infrastructure, more stringent requirements, and the higher spatial and temporal granularity forced GlideinWMS to revisit once more how credentials are generated, used, and propagated.
The new credential modules have been designed to be used in multiple systems (GWMS, HC) and use a model where credentials have type, purpose, and different flows.
Credentials are dynamically generated in order to customize the duration and limit the scope to the targeted resource. This allows to enforce the least privilege principle. Finally, we also considered adding credential storage, renewal, and invalidation mechanisms within the GlideinWMS infrastructure to serve better the experiments' needs.
Simulation and analysis tools
Non perturbative QED is used to predict beam backgrounds at the interaction point of colliders, in calculations of Schwinger pair creation and in precision QED tests with ultra-intense lasers. In order to predict these phenomena, custom built monte carlo event generators based on a suitable non perturbative theory have to be developed. One such suitable theory uses the Furry Interaction Picture, in which a background field is taken into account non perturbatively at Lagrangian level. This theory is precise, but the transition probabilities are in general, complicated. This poses a challenge for the monte carlo which struggles to implement the theory computatively. The monte carlo must in addition taken into acount the behaviour of the background field at every space-time point at which an event is generated. We introduce here just such a monte carlo package, called IPstrong, and the techniques implemented to deal with the specific challenges outlined above.
The effort to speed up the Madgraph5_aMC@NLO generator by exploiting CPU vectorization and GPUs, which started at the beginning of 2020, is expected to deliver the first production release of the code for QCD leading-order (LO) processes in 2024. To achieve this goal, many additional tests, fixes and improvements have been carried out by the development team in recent months, both to carry out its internal workplan and to respond to the feedback from the LHC experiments about the current and required functionalities of the software. Several new physics processes, including both Standard Model and Beyond Standard Model calculations, have been tested and extensively debugged. Support for AMD GPUs via native HIP has been added to the CUDA/C++ baseline implementation of the code; work is in progress to also add support for Intel GPUs to this CUDA/C++ plugin, based on the parallel SYCL implementation developed in the past. The user interface and packaging of the software, and the usability challenges coming from the large number of events that must be generated in parallel on a GPU, have also been an active area of development. In this contribution, we will report on these activities and on the status of the LO software at the time of the CHEP2024 conference. The status and outlook for one of the main further directions of our development effort, notably the support of next-to-leading-order (NLO) processes, is described in a separate contribution to this conference.
As the quality of experimental measurements increases, so does the need for Monte Carlo-generated simulated events — both with respect to total amount, and to their precision. In perturbative methods this involves the evaluation of higher order corrections to the leading order (LO) scattering amplitudes, including real emissions and loop corrections. Although experimental uncertainties today are larger than those of simulations, at the High Luminosity LHC experimental precision is expected to increase above the theoretical one for events generated below next-to-leading order (NLO) precision. As forecasted hardware resources do not meet CPU requirements for these simulation needs, speeding up NLO event generation is a necessity for particle physics research.
In recent years, collaborators across Europe and the United States have been working on CPU vectorisation of LO event generation within the MadGraph5_aMC@NLO framework, as well as porting it to GPUs, to major success. Recently, development has also started on vectorising NLO event generation. Due to the more complicated nature of NLO amplitudes this development faces several difficulties not accounted for in LO development. Nevertheless, this development seems promising, and a status report as well as the latest results will be presented in this contribution.
Quantum computers may revolutionize event generation for collider physics by allowing calculation of scattering amplitudes from full quantum simulation of field theories. Although rapid progress is being made in understanding how best to encode quantum fields onto the states of quantum registers, most formulations are lattice-based and would require an impractically large number of qubits when applied to scattering events at colliders with a wide momentum dynamic range. In this regard, the single-particle digitization approach of Barata et al. (Phys. Rev. A 103) is highly attractive for its qubit efficiency and strong association with scattering. Since the original work established the digitization scheme on the scalar phi4 theory, we explore its extensions to fermion fields and other types of interactions. We then implement small-scale scattering simulations on both real quantum computers and a statevector calculator run on HPCs. A possible roadmap toward realizing the ultimate goal of performing collider event generation from quantum computers will be discussed.
The generation of large event samples with Monte Carlo Event Generators is expected to be a computational bottleneck for precision phenomenology at the HL-LHC and beyond. This is due in part to the computational cost incurred by negative weights in 'matched' calculations combining NLO perturbative QCD with a parton shower: for the same target uncertainty, a larger sample must be generated.
We summarise two approaches taken to tackle this problem in Herwig: the development of the KrkNLO matching method, which uses a redefinition of the PDF factorisation scheme to guarantee positive weights by construction, and the restructuring of the Matchbox module to reduce the fraction of negative weights for MC@NLO matching.
The generation of Monte Carlo events is a crucial step for all particle collider experiments. Accurately simulating the hard scattering processes is the foundation for subsequent steps, such as QCD parton showering, hadronization, and detector simulations. A major challenge in event generation is the efficient sampling of the phase spaces of hard scattering processes due to the potentially large number and complexity of Feynman diagrams and their interference and divergence structures.
In this presentation, we address the challenges of efficient Monte Carlo event generation and demonstrate improvements that can be achieved through the application of advanced sampling techniques. We highlight that using the algorithms implemented in BAT.jl for sampling the phase spaces given by Sherpa offers great flexibility in the choice of sampling algorithms and has the potential to significantly enhance the efficiency of event generation.
By interfacing BAT.jl, a package designed for Bayesian analyses that offers a collection of modern sampling algorithms, with the Sherpa event generator, we aim to improve the efficiency of phase space exploration and Monte Carlo event generation. We combine the physics-informed multi-channel sampling approach of Sherpa with advanced sampling techniques such as Markov Chain Monte Carlo (MCMC) and Nested Sampling. Additionally, we investigate the potential of novel machine learning-enhanced sampling methods to optimize phase space mappings and accelerate the event generation process. The current prototype interface between Sherpa and BAT.jl features a modular design that offers full flexibility in selecting target processes and provides detailed control over the sampling algorithms. It also allows for a simple integration of innovative sampling techniques such as normalizing flow-enhanced MCMC.
Simulation and analysis tools
Within the ROOT/TMVA project, we have developed a tool called SOFIE, that takes externally trained deep learning models in ONNX format or Keras and PyTorch native formats and generates C++ code that can be easily included and invoked for fast inference of the model. The code has a minimal dependency and can be easily integrated into the data processing and analysis workflows of the HEP experiments.
This study presents a comprehensive benchmark analysis of SOFIE and prominent machine learning frameworks for model evaluation such as PyTorch, TensorFlow XLA and ONNXRunTime. Our research focuses on evaluating the performance of these tools in the context of HEP, with an emphasis on their application with typical models used, such as Graph Neural Netwarks for jet tagging and Variation auro-encoder and GAN for fast simulation. We assess the tools based on several key parameters, including computational speed, memory usage, scalability, and ease of integration with existing HEP software ecosystems. Through this comparative study, we aim to provide insights that can guide the HEP community in selecting the most suitable framework for their specific needs.
The HIBEAM-NNBAR experiment at the European Spallation Source is a multidisciplinary two-stage program of experiments that includes high-sensitivity searches for neutron oscillations, searches for sterile neutrons, searches for axions, as well as the search for exotic decays of the neutron. The computing framework of the collaboration includes diverse software, from particle generators to Monte Carlo transport codes, which are uniquely interfaced together. Significant advances have been made in computing and simulation for HIBEAM-NNBAR, particularly with machine learning applications and with the introduction of fast parametric simulations in Geant4. A summary of the simulation steps of the experiment, including beamline, cosmic veto system, as well as detector simulations and estimation of the background processes, will be presented.
The ROOT software framework is widely used in HENP for storage, processing, analysis and visualization of large datasets. With the large increase in usage of ML for experiment workflows, especially lately in the last steps of the analysis pipeline, the matter of exposing ROOT data ergonomically to ML models becomes ever more pressing. This contribution presents the advancements in an experimental component of ROOT that exposes datasets in batches ready for the training phase. This fe