Skip to main content

HEPiX Spring 2025 Workshop

Europe/Zurich
Hotel De La Paix

Hotel De La Paix

Via Giuseppe Cattori 18 6900 Lugano Switzerland
Ofer Rind (Brookhaven National Laboratory), Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES)), Tomoaki Nakamura, Dino Conciatore (CSCS (Swiss National Supercomputing Centre))
Description

HEPiX Spring 2025 at the Swiss National Supercomputing Centre in Lugano

Please take note of the following important deadlines:

  • Regular registration and abstract submissions will close on March 16, 2025, at 23:59 (CET)
  • We regret to inform you that all seats for the Sponsored Social Dinner have been filled. The restaurant has reached its capacity of 110 seats.

 

The HEPiX forum brings together worldwide information technology staff, including system administrators, system engineers, and managers from High Energy Physics and Nuclear Physics laboratories and institutes, to foster a learning and sharing experience between sites facing scientific computing and data challenges.

Participating sites include BNL, CERN, DESY, FNAL, IHEP, IN2P3, INFN, IRFU, JLAB, KEK, LBNL, NDGF, NIKHEF, PIC, RAL, SLAC, TRIUMF, and many other research labs and universities from all over the world.

More information about the HEPiX workshops, the working groups (who report regularly at the workshops) and other events is available on the HEPiX Web site.

This workshop will be hosted by CSCS the Swiss National Supercomputing Centre and will be held at the Hotel De La Paix Lugano.

 

SPONSORS

GOLD


 
 

ACADEMIC COLLABORATIONS


 

BRONZE


Participants
    • 08:00 09:00
      Registration 1h
    • 09:00 09:30
      Welcome
      Convener: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 09:00
        Welcome talk 15m
        Speaker: Pablo Fernandez
      • 09:15
        Logistics talk 15m
        Speaker: Mr Dino Conciatore (CSCS (Swiss National Supercomputing Centre))
    • 09:30 10:30
      Site Reports
      Convener: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))
      • 09:30
        CSCS site report 15m

        CSCS will provide all updates as Tier 2

        Speaker: Miguel Gila (ETH Zurich)
      • 09:45
        KEK Site Report 15m

        KEK is promoting various accelerator science projects by fully utilizing the electron accelerator in Tsukuba and the proton accelerator in Tokai.
        These projects require a large amount of data processing, and our central computing system, KEKCC, takes a key role in their success. KEKCC
        also has an aspect that works as part of the Grid system, which is essential to the Bell II project.
        We will report on the operational status of KEKCC and our campus network systems, and related activities, such as security operations.

        Speaker: Ryo Yonamine (KEK)
      • 10:00
        NCG-INGRID-PT site report 15m

        Evolution of the NCG-INGRID-PT site and future perspectives.

        Speaker: Jorge Gomes (LIP)
      • 10:15
        NDGF Site Report 15m

        New development in the distributed Nordic Tier-1 and it's participant sites.

        Speaker: Mattias Wadenstein (University of Umeå (SE))
    • 10:30 11:00
      Coffee Break 30m
    • 11:00 12:15
      Site Reports
      Convener: Dr Sebastien Gadrat (CCIN2P3 - Centre de Calcul (FR))
      • 11:00
        DESY site report 15m

        DESY site report

        Speaker: Yves Kemp
      • 11:15
        BNL Site Report 15m

        An update on recent developments at the Scientific Computing and Data Facilities (SCDF) at BNL.

        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 11:30
        US ATLAS SouthWest Tier2 Site Report 15m

        An update on recent advancements at the US ATLAS SouthWest Tier2 Center (UTA/OU).

        Speaker: Horst Severini (University of Oklahoma (US))
      • 11:45
        Introduction of LHCb Tier2 Site at Lanzhou University 15m

        This report introduce the LHCb Tier-2 site at Lanzhou University(LZU-T2), which is a major new computing resource designed to support the LHCb experiment. It is part of the Worldwide LHC Computing Grid, which distributes data processing and storage across a network of international computing centers. The LZU-T2 site plays a critical role in processing, analyzing, and storing the vast amounts of data produced by the experiment. Its establishment helps balance the computing load regionally, complementing other Chinese sites such as the Beijing Tier-1 facility. The LZU-T2 site not only supports LHCb’s data challenges but also helps integrate regional computing resources.

        Speaker: Dong Xiao (Lanzhou University)
      • 12:00
        CERN site report 15m

        News from CERN since the last HEPiX workshop. This talk gives a general update from services in the CERN IT department.

        Speaker: Elvin Alin Sindrilaru (CERN)
    • 12:15 13:30
      Lunch 1h 15m
    • 13:30 15:30
      Storage & data management
      Convener: Dr Andrew Pickford (Nikhef)
      • 13:30
        EOS latest developments and operational experience 20m

        EOS is an open-source storage system developed at CERN that is used as the main platform to store LHC data. The architecture of the EOS system has evolved over the years to accommodate ever more diverse use-cases and performance requirements coming both from the LHC experiments as well as from the user community running their analysis workflows on top of EOS. In this presentation, we discuss the performance of the EOS service during the 2025 Run and also outline the latest developments targeting diverse areas like file system consistency checks improvements, deployment of a high-availability setup for the metadata service, namespace locking and performance optimizations as well as other topics. Apart from new developments, we also discuss changes in the deployment model, especially moving to a native HTTP approach and the commissioning of the GRPC interface used by the CernBox service. To conclude the presentation, we outline some of the key achievements from 2025 that make us confident our system is fully prepared for the challenges that wait ahead as we approach the end of Run 3 and the preparation for High Luminosity LHC.

        Speaker: Elvin Alin Sindrilaru (CERN)
      • 13:50
        Design and production experience with a multi-petabyte file-system backup service at CERN 20m

        We report on our experience with the production backup orchestration via “cback”, a tool developed at CERN and used to back up our primary mounted filesystem offerings: EOS (eosxd) and Ceph (CephFS). In a storage system that handles non-reproducible data, a robust backup and restore system is essential for effective disaster recovery and business continuity. When designing a backup solution, it is crucial to consider the same factors that apply to the production service itself: scalability, performance, security, and operational costs. In this contribution, we will discuss the challenges we encountered, the decisions we made, and the innovative strategies we implemented while designing cback. Many of these insights can be applied to other backup strategies as well.

        Speaker: Roberto Valverde Cameselle (CERN)
      • 14:10
        Refurbishing the Meyrin Data Centre: Storage Juggling and Operations 20m

        The 50-year-old Meyrin Data Centre (MDC), still remains indispensable due to its strategic geographical location and unique electrical power resilience even if CERN IT recently commissioned the Prévessin Data Centre (PDC), doubling the organization’s hosting capacity in terms of electricity and cooling. The Meyrin Data Centre (Building 513) retains an essential role for the CERN Tier-0 Run 4 commitments, notably as primary hosting location for the tape archive and the disk storage. The inevitable investments to the infrastructure (UPS and Cooling) are now triggering the refurbishment of the two main rooms where all the storage equipment is hosted. This presentation will delve into the architectural advancements and operational strategies implemented for and during the Meyrin data centre refurbishment. We will explore how these developments will impact our storage and how the storage operations team will ensure EOS’s performance, scalability, and reliability in the coming years.

        Speaker: Octavian-Mihai Matei
      • 14:30
        A Distributed Storage Odyssey: from CentOS7 to ALMA9 20m

        On the 30th of June 2024, the end of CentOS 7 support marked a new era for the operation of the multi-petabytes distributed disk storage system used by CERN physics experiments. The EOS infrastructure at CERN is composed of aproximately 1000 disk servers and 50 metadata management nodes. Their transition from CentOS 7 to Alma 9 was not as straightforward as anticipated.

        This presentation will be all about explaining this transition. From the change of supported certificate and kerberos key signature lengths and algorithms, to openssl library hiccups and Linux kernel crashes, the EOS operation team had to take on different challenges to ensure a seamless operating system transition of the infrastructure while maintaining uninterrupted CERN experiments’ data transfers.

        Speaker: Cedric Caffy (CERN)
      • 14:50
        Label-based Virtual Directories In dCache 20m

        Traditional filesystems organize data into directories based on a single criterion, such as the starting date of the experiment, experiment name, beamline ID, measurement device, or instrument. However, each file within a directory can belong to multiple logical groups, such as a special event type, experiment condition, or part of a selected dataset. dCache, a storage system designed to handle large volumes of scientific data, is widely used in High Energy Physics (HEP) and Photon Science experiments. Recent advancements in dCache have introduced the concept of file tagging, which dynamically groups files with the same label into virtual directories. These file labels can be added, removed, renamed, and deleted through an admin interface or via a REST API. The files in these virtual directories are accessible through all protocols supported by dCache.
        This presentation will delve into the implementation details of file tagging in dCache and outline our future development plans, including automatic metadata extraction. This feature aims to significantly simplify data management. Furthermore, we are exploring the use of virtual directories to translate scientific data catalogs into filesystem views, enabling direct data analysis. We will also discuss our new developments in the context of the National Analysis Facility (NAF) at DESY.

        Speaker: Marina Sahakyan
      • 15:10
        NVMe-HDD Solution-Level Usage Models, Features and Advantages 20m

        The NVMe HDD Specification were released back in 2022, but only very early Engineering Demo Units have been created so far from a single source. That said, the market demand is definitely growing, and the industry must pay attention to the potential TCO and storage stack optimizations that a unified NVMe storage interface could offer. In this session, we will go over the TCO analysis details as well as the stack simplification advantage of NVMe in Cloud and AI applications. We also go over the various NVMe features that need to be implemented for HDDs, and the various NVMe TPARs that would be needed. The call to action is for us to revive the NVMe-HDD workstream review these various proposed features and collaborate on ask for the various required TPARs to be created within the NVMe committee to support this effort.

        Speakers: Hugo Bergmann (Seagate Technology), Mohamad El-Batal
    • 15:30 16:00
      Coffee Break 30m
    • 16:00 17:25
      Environmental sustainability, business continuity, and Facility improvement
      Convener: Peter Gronbech (University of Oxford (GB))
      • 16:00
        Updates on CPUs, GPUs and AI accelerators 20m

        In this presentation we try to give an update on CPU, GPU and AI accelerators in the market today.

        Speaker: Dr Michele Michelotto (Universita e INFN, Padova (IT))
      • 16:40
        Nikhef is renovated. So what to do with our new meetingrooms technology-wise? 20m

        Nikhef has recently renovated their building and upgraded almost everything to the latest standards. Including the Audio/Video setup in the new meetingrooms.

        This talk will give an insight in the proces from choosing which technologies and tendering to installation, testing and getting everything working. What went wrong and what not. How you would think that 4K 60Hz is easy these days. Why a Playstation 5 is very important to have for this project.
        Building a setup that is easy, flexible and robust to use for a diverse set of users.

        Speaker: Tristan Suerink
    • 18:00 23:00
      Welcome Reception 5h Hotel Splendide Royal

      Hotel Splendide Royal

      Riva Antonio Caccia, 7 - 6900 Lugano, Switzerland

      apero + dinner

    • 09:00 09:30
      Science talk
      Convener: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 09:00
        Cherenkov Telescope Array Observatory 30m

        Cherenkov Telescope Array Observatory (CTAO) is a next-generation ground-based gamma-ray astronomical observatory that is in construction phase on two sites: in Northern and Southern hemispheres. CTAO telescopes use the atmosphere as giant detector of high-energy particles. CTAO data contain "events" of extensive air showers of high-energy particles. Most of the showers are induced by charged cosmic rays, but a small fraction of these showers are induced by gamma-rays coming from astronomical sources that operate powerful particle colliders. CTAO will generate about ten Petabytes of event data per year, complemented by a comparable amount of Monte-Carlo simulated data. These data will be managed by four Off-Site Data Centers distributed across Europe and coordinated by the Science Data Management Center. One of the four Off-site Data Centers will be hosted by CSCS in Switzerland. The Off-Site Data Centres will perform reduction of the large volume ("Data Level 0") real and Monte-Carlo data to extract information on gamma-ray-like events that carry information on the astronomical sources. Such reduced data (Data Level 3) will be distributed to astronomers across the world and ultimately made publicly available.

        Speaker: Andrii Neronov (EPFL and APC Paris)
    • 09:30 10:15
      Site Reports
      Convener: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))
      • 09:30
        IHEP site report 15m

        The progress and status of IHEP site since last Hepix.

        Speaker: Chaoqi Guo (Institute of High Energy Physics of the Chinese Academy of Sciences)
      • 09:45
        RAL Site Report 15m

        An update on activities at RAL

        Speaker: Martin Bly (STFC-RAL)
      • 10:00
        CTAO Swiss Data Center and Data Processing and Preservation System Deployment Strategy 15m

        The Cherenkov Telescope Array Observatory (CTAO) is the next-generation gamma-ray telescope facility, currently under construction.

        The CTAO recently reached a set of crucial milestones: it has been established as a European Research Infrastructure Consortium (ERIC), all four Large-Sized Telescopes at the northern site of the Observatory reached key construction milestones, and the first version of the Data Processing and Preservation System (DPPS) has been released.

        I will present a brief overview of the implementation of the Swiss CTAO Data Center, and the current status of the CTAO DPPS deployment.

        I will explain how the CTAO adoption of WLCG technologies (including Rucio, DIRAC, Indigo IAM) enables synergies with many HEP projects.

        Speaker: Dr Volodymyr Savchenko (EPFL, Switzerland)
    • 10:15 10:30
      Software and Services for Operation
      Convener: Dennis van Dok (Nikhef)
      • 10:15
        Infrastructure Monitoring for GridKa and beyond 15m

        The Infrastructure Monitoring helps to control and monitor in real-time servers and applications involved in the operation of the WLCG Tier1 center GridKa, including the online and tape storages, the batch system and the GridKa network.
        Monitoring data like server metrics (CPU, Memory, Disk, Network), storage operations (I/O Statistics) or visualizing real-time sensors data such as temperature, humidity, power consumption in server rooms are very important to provide a complete picture of availability, performance and resource efficiency of the entire data center.

        Through the integration of open source and widely known technologies we have built a scalable solution able to collect, store and visualize infrastructure data across the data center. In this presentation we will talk about the main components of our monitoring architecture and the technologies we use. They include Telegraf as agent to collect metrics, InfluxDB as timeseries database to store data and Grafana as powerful visualization tool to query and visualize data. In addition, we operate a 5-nodes cluster based on OpenSearch search engine to collect logs from many sources.

        Speaker: Evelina Buttitta (Karlsruhe Institute of Technology (KIT))
    • 10:30 11:00
      Coffee Break 30m
    • 11:00 12:15
      Environmental sustainability, business continuity, and Facility improvement
      Convener: Peter Gronbech (University of Oxford (GB))
      • 11:00
        CSCS Sustainability: Utilizing Lake Water for Cooling and Reusing Waste Heat 20m

        The Swiss National Supercomputing Centre (CSCS) is committed to sustainable high-performance computing. This talk will explore how CSCS leverages lake water for efficient cooling, significantly reducing energy consumption. Additionally, we will discuss the reuse of waste heat to support local infrastructure, demonstrating a practical and efficient approach to sustainability in supercomputing.

        Speaker: Mr Tiziano Belotti (CSCS)
      • 11:20
        Natural job drainage and power reduction studies in PIC Tier-1 using HTCondor 20m

        This study presents analyses of natural job drainage and power reduction patterns in the PIC Tier-1 data center, which uses HTCondor for workload scheduling. By examining historical HTCondor logs from 2023 and 2024, we simulate natural job drainage behaviors, in order to understand natural job drainage patterns: when jobs naturally conclude without external intervention. These findings provide insights into the center's capability to modulate resource usage according to external factors like green energy availability cycles.

        To further validate and extend these observations, simulations were conducted under various load conditions to evaluate the influence of job types and VO-specific durations on drainage cycles. An analysis of power consumption pre- and post-drainage, facilitated by ipmitool, allows for estimating potential power and carbon emission reductions in drainage scenarios. Building on these insights, machine learning models are being developed to predict optimal power scaling adjustments.

        We propose a conceptual feedback loop to HTCondor that could enable real-time power adjustments based on fluctuations in green energy availability. By exploring these ideas, this research aims to contribute to a more sustainable data center model, offering a framework for adapting workload management to dynamic environmental factors.

        Speaker: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 11:40
        Transforming the Disaster Recovery of the Cloud Service 20m

        The CERN Cloud Infrastructure Service provides access to large compute and storage resources for the laboratory that includes virtual and physical machines, volumes, fileshares, loadbalancers, etc. across 2 different datacenters. With the recent addition of the Prevessin Data Center, one of the main objectives of the CERN IT Department is to ensure that all services have up-to-date procedures for disaster recovery, including the CERN Private Cloud Service.

        To implement the BC/DR policy, the Cloud Team has not only executed the recovery test but also built automation on top of it. This talk will dive into the usage of tools like Terraform to ensure the redeployment and recovery of the control plane in case of a major outage.

        Speaker: Varsha Bhat
      • 12:00
        Update on Energy Efficiency: AmpereOne and Turin 15m

        Extending the data presented at the last few HEPiX workshops, we present new measurements on the energy efficiency (HEPScore/Watt) of the recently available AmpereOne-ARM and AMD Turin-x86 machines.

        Speaker: David Britton (University of Glasgow (GB))
    • 12:15 13:30
      Lunch 1h 15m
    • 13:30 15:20
      Mid-long term evolution of facilities (Topical Session with WLCG OTF)
      Conveners: Alessandro Di Girolamo (CERN), Helge Meinhard (CERN)
      • 13:30
        Introduction to HEPIX-OTF Topical Session 5m
        Speakers: Alessandro Di Girolamo (CERN), Helge Meinhard (CERN), James Letts (Univ. of California San Diego (US))
      • 13:35
        Storage challenges (DOMA + capacity vs performance) 20m
        Speaker: Shawn Mc Kee (University of Michigan (US))
      • 13:55
        HEPiX Technology Watch Working Group Report 25m

        The Technology Watch Working Group, established in 2018 to take a close look at the evolution of the technology relevant to HEP computing, has resumed its activities after a long pause. In this report, we provide an overview of the hardware technology landscape and some recent developments, highlighting the impact on the HEP computing community.

        Speaker: Dr Andrea Sciabà (CERN)
      • 14:20
        Italy vision 30m
        Speaker: Daniele Spiga (Universita e INFN, Perugia (IT))
      • 14:50
        IDAF @ DESY: Interdisciplinary Data and Analysis Facility: Status and Plans 30m

        DESY operates the IDAF (Interdisciplinary Data and Analysis Facility) for all science branches: high energy physics, photon science, and accelerator R&D and operations.
        The NAF (National Analysis Facility) is an integrated part, and acts as an analysis facility for the German ATLAS and CMS community as well as the global BELLE II community since 2007.
        This presentation will show the current status and further plans of the implementation, driven by use cases of the different user communities.

        Speaker: Yves Kemp (Deutsches Elektronen-Synchrotron (DE))
    • 15:20 15:50
      Coffee Break 30m
    • 15:50 17:20
      Mid-long term evolution of facilities (Topical Session with WLCG OTF)
      Conveners: Alessandro Di Girolamo (CERN), Helge Meinhard (CERN)
      • 15:50
        German University Tier-2s evolution 30m

        Transition of German University Tier-2 Resources to HPC Compute and Helmholz Storage

        The March 2022 perspective paper of the German Committee for Elementary Particle Physics proposes a transformation of the provision of computing resources in Germany. In preparation for the HL-LHC, the German university Tier 2 centres are to undergo a transition towards a more resource-efficient and environmentally friendly provision of computing and disk storage. To this end, the mass storage facilities of the university Tier 2 centres are to be gradually replaced by the Helmholtz centres DESY and KIT and the computing capacities by computing time at the the National High Performance Computing centres of the NHR Alliance (NHR).
        This article will summarise the transformation, the technical implementation and initial experiences.

        Speaker: Michael Boehler (University of Freiburg (DE))
      • 16:20
        Evolution of US ATLAS Sites 30m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 16:50
        "Round table" - Facilities in WLCG Technical Roadmap 30m
    • 09:00 09:30
      Science talk
      Convener: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 09:00
        The Big Data Challenge of Radio Astronomy 30m

        Radio astronomers are engaged in an ambitious new project to detect faster, fainter, and more distant astrophysical phenomena using thousands of individual radio receivers linked through interferometry. The expected deluge of data (up to 300 PB per year) poses a significant computational challenge that requires rethinking and redesigning the state-of-the-art data analysis pipelines.

        Speaker: Dr Emma Elizabeth Tolley
    • 09:30 10:30
      Cloud Technologies, Virtualization & Orchestration, Operating Systems
      Convener: Mr Dino Conciatore (CSCS (Swiss National Supercomputing Centre))
      • 09:30
        Kubernetes and Cloud Native at the SKA Regional Centres 20m

        The SKA Observatory is expected to be producing up to 600 petabytes of scientific data per year, which would set a new record in data generation within the field of observational astronomy. The SRCNet infrastructure is meant for handling these large volumes of astronomy data, which requires a global network of distributed regional centres for the data- and compute-intensive astronomy use cases. On the Swiss SRCNode, we aim to use Kubernetes as a service management plane which interacts with external storage and compute services as part of SRCNet to build a science analysis platform.

        Speaker: Lukas Gehrig (FHNW)
      • 09:50
        Cloud-native ATLAS T2 on Kubernetes 20m

        The University of Victoria operates a scientific OpenStack cloud for Canadian researchers, and the CA-VICTORIA-WESTGRID-T2 grid site for the ATLAS experiment at CERN. We are shifting both of these service offerings towards a Kubernetes-based approach. We have exploited the batch capabilities of Kubernetes to run grid computing jobs and replace the conventional grid computing elements by interfacing with the Harvester workload management system of the ATLAS experiment. We have also adapted and migrated the APEL accounting service and Squid caching proxies to cloud-native deployments on Kubernetes, and are preparing a Kubernetes-based EOS storage element. We aim to enable fully comprehensive deployment of a complete ATLAS Tier 2 site on a Kubernetes cluster via Helm charts. Moreover, we are now preparing to deploy Openstack itself on a bare metal Kubernetes cluster.

        Speaker: Ryan Taylor (University of Victoria (CA))
      • 10:10
        Hyper-converged cloud infrastructure at CSCS 20m

        This presentation provides a detailed overview of the hyper-converged cloud infrastructure implemented at the Swiss National Supercomputing Centre (CSCS). The main objective is to provide a detailed overview of the integration between Kubernetes (RKE2) and ArgoCD, with Rancher acting as a central tool for managing and deploying RKE2 clusters infrastructure-wide.

        Rancher is used for direct deployment on MAAS-managed nodes, as well as HPC (High-Performance Computing) nodes designed for high-intensity workloads. In addition, Harvester orchestrates Kubernetes distributions for virtual clusters, improving flexibility and simplifying orchestration on the platform.

        ArgoCD plays a key role in automating deployment processes and ensuring consistency between different environments, enabling continuous delivery. The integration of Kubernetes, ArgoCD, Rancher, Harvester and Terraform forms the basis of a hyper-converged, scalable and adaptable cloud infrastructure.

        This case study provides information on the architecture, deployment workflows and operational benefits of this approach.

        Speakers: Mr Dino Conciatore (CSCS (Swiss National Supercomputing Centre)), Elia Oggian (ETH Zurich (CH))
    • 10:30 11:00
      Coffee Break 30m
    • 11:00 12:15
      Network & Security
      Convener: Shawn Mc Kee (University of Michigan (US))
      • 11:00
        CERN Prevessin Datacentre network - Overview and feedback after one year in production. 20m

        This presentation will explain the network design implemented in the CERN Prévessin Datacentre (built in 2022/2023, in production since February 2024). We will show how, starting from an empty building, the current network best practices could be adopted (and partly adapted to match the specific requirements in term of interconnection with the rest of CERN network). We will also provide feedback about some issues encountered during the planning and deployment, give an overview of the network performances after one full year in production, and share with you our current ideas (and questions) regarding CERN Datacentre(s) network possible evolution in the coming years.

        Speaker: Vincent Ducret (CERN)
      • 11:20
        Cybersecurity at the Speed of HPC - Monitoring and Incident Response 20m

        High-Performance Computing (HPC) environments demand extreme speed and efficiency, making cybersecurity particularly challenging. The need to implement security controls without compromising performance presents a unique dilemma: how can we ensure robust protection while maintaining computational efficiency?
        This presentation will give an insight into real-world challenges and measures implemented at CSCS that can help the investigation in case of potential incidents. Despite advances in security tools, critical vulnerabilities remain, particularly in managing and analysing large-scale data flows in real time.

        Speaker: Mr Fabio Zambrino (CSCS)
      • 11:40
        Improving CERN's security with an Endpoint Detection and Response Solution 15m

        The deployment of an Endpoint Detection & Response (EDR) solution at CERN has been a project aimed at enhancing the security posture of endpoint devices. In this presentation we’ll share our infrastructure's architecture and how we rolled out the solution. We will also see how we addressed and overcome challenges on multiple fronts from administrator’s fears to fine-tuning detections and balancing performance with security requirements. Finally, we will showcase the capabilities delivered by an EDR solution including real-time threat detection, incident response, and enhanced visibility across endpoints.

        Speaker: Alexandros Petridis
      • 11:55
        Computer Security Update 20m

        This presentation aims to give an update on the global security landscape from the past year. The global political situation has introduced a novel challenge for security teams everywhere. What's more, the worrying trend of data leaks, password dumps, ransomware attacks and new security vulnerabilities does not seem to slow down.

        We present some interesting cases that CERN and the wider HEP community dealt with in the last year, mitigations to prevent possible attacks in the future and preparations for when inevitably an attacker breaks in.

        Speaker: Jose Carlos Luna Duran (CERN)
    • 12:15 13:30
      Lunch 1h 15m
    • 13:30 15:30
      Storage & data management
      Convener: Elia Oggian (ETH Zurich (CH))
      • 13:30
        How CERN’s New Datacenter Enhances Cloud Infrastructure and Data Resilience with Ceph 15m

        The storage needs of CERN’s OpenStack cloud infrastructure are fulfilled by Ceph, which provides diverse storage solutions including volumes with Ceph RBD, file sharing through CephFS, and S3 object storage via Ceph RadosGW. The integration between storage and compute resources is possible thanks a to close collaboration between OpenStack and Ceph teams. In this talk we review the architecture of our production deployment and how it evolved with the arrival of the new datacenter in Prevessin (France) in the context of supporting BC/DR scenarios

        Speaker: Roberto Valverde Cameselle (CERN)
      • 13:45
        CERN update on tape technology 25m

        This presentation with start with the evolution of the tape technology market in the recent years and the expectations from the INSIC roadmap.

        From there, with LHC now in the middle of Run 3, we will reflect on the evolution of our capacity planning vs. increasing storage requirements of the experiments. We will then describe our current tape hardware setup and present our experience with the different components of the technology. For example, we will report on performance characteristics of both LTO9 and TS1170 tape drives: RAO, environmental aspects and how the technology evolution is impacting our operations.

        Lastly, we will share our thoughts about rack size scale-out tape libraries and consideration to replace FC with SAS.

        Speaker: Vladimir Bahyl (CERN)
      • 14:10
        Evolution of Continuous Integration for the CERN Tape Archive (CTA) 20m

        The CERN Tape Archive (CTA) software is used for physics archival at CERN and other scientific institutes. CTA’s Continuous Integration (CI) system has been around since the inception of the project, but over time several limitations have become apparent. The migration from CERN CentOS 7 to Alma 9 introduced even more challenges. The CTA team took this as an opportunity to make significant improvements in the areas of simplicity, flexibility and robustness. The most impactful change was the migration from plain Kubernetes manifest files to Helm, allowing us to decouple the configuration of CTA from the EOS disk system configuration and opening up opportunities to test other disk buffer systems such as dCache. The new setup allows us to handle complex testing scenarios and perform regression testing on various components independently. We will discuss the challenges we encountered with our CI, the improvements we implemented to address them, and what we hope to do in the future.

        Speaker: Niels Alexander Buegel
      • 14:30
        The CERN Tape Archive Beyond CERN 20m

        The CERN Tape Archive (CTA) is CERN’s Free and Open Source Software system for data archival to tape. Across the Worldwide LHC Computing Grid (WLCG), the tape software landscape is quite heterogeneous, but we are entering a period of consolidation. A number of sites have reevaluated their options and have chosen CTA for their tape archival storage needs. To facilitate this, the CTA team have added a number of community features, allowing CTA to be used as the tape backend for dCache and to facilitate migrations from other tape systems such as OSM and Enstore. CTA is now packaged and distributed as a public release, free from CERN-specific dependencies, together with a set of operations tools. This contribution presents the latest CTA community features and roadmap.

        Speaker: Niels Alexander Buegel
      • 14:50
        Storage Technology Outlook 20m

        Storage Technology Outlook
        The rapid growth of data has outpaced traditional hard disk drive (HDD) scaling, leading to challenges in cost, capacity, and sustainability. This presentation examines the trends in storage technologies highlighting the evolving role of tape technology in archive solutions. Unlike HDDs, tape continues to scale without hitting fundamental physics barriers, offering continual increases in areal density along with superior energy efficiency and cost-effectiveness. With a strong technology roadmap extending into the 2030s, tape is positioned as the best solution for archival and cold storage needs.

        Speaker: Ed Childers (SpectraLogic)
      • 15:10
        Online Seamless HDD Self-Healing Options & Capabilities 20m

        o The most common mechanical failures in today's modern HDDs in the datacenter are no longer due to motor/actuator failures of head crashes. The great majority of these failures are due to Writer head degradation with time and heat, a small minority to Reader failures and a very small number of failures are due to other causes. The scope of this presentation is to present and discuss the various methods and options we have at our disposal to mitigate these various head failure scenarios without causing the drive to be replaced or completely reformatted and the data getting rebuilt by the host software at the system level, thus causing a significant amount of data traffic and reducing the overall resiliency, availability and reliability of the overall storage solution. We currently have many options in our toolbox to address the impact of these head failures and resolve them while the drive and the majority of its data can be preserved. We will discuss these various solutions and point out the pros and cons of each implementation as some of these solutions require host management and some can be done seamlessly.

        Speakers: Curtis Stevens, Hugo Bergmann (Seagate Technology)
    • 15:30 16:00
      Coffee Break 30m
    • 16:00 17:20
      Computing and Batch Services
      Convener: Dr Michele Michelotto (Universita e INFN, Padova (IT))
      • 16:00
        HEPiX Benchmarking Working Group Report 25m

        The Benchmarking Working Group (WG) has been actively advancing the HEP Benchmark Suite to meet the evolving needs of the Worldwide LHC Computing Grid (WLCG). This presentation will provide a comprehensive status report on the WG’s activities, highlighting the intense efforts to enhance the suite’s capabilities with a focus on performance optimization and sustainability.

        In response to community feedback, the WG has developed new modules to measure server utilization metrics, including load, frequency, I/O, and power consumption, during the execution of the HEPScore benchmark. These advancements enable a more detailed evaluation of power efficiency and computational performance of servers, aligning with WLCG’s sustainability goals.

        Furthermore, updates on the integration of GPU workloads into the benchmark suite will be presented. This significant development expands the functionality of HEPScore, increases the catalogue of available workloads, and enhances the suite’s applicability to modern and diverse computing environments.

        Speaker: Domenico Giordano (CERN)
      • 16:25
        Continuous calibration and monitoring of WLCG site corepower with HEPScore23 20m

        The performance score per CPU core — corepower — reported annually by WLCG sites is a critical metric for ensuring reliable accounting, transparency, trust, and efficient resource utilization across experiment sites. It is therefore essential to compare the published CPU corepower with the actual runtime corepower observed in production environments. Traditionally, sites have reported annual performance values based on weighted averages of various CPU models, yet until now there was no direct method to validate these figures or to easily retrieve the underlying CPU model weights from each site.
        With the official adoption of HEPScore23 as a benchmark in April 2023 by the WLCG, the Benchmarking Working Group introduced new tools, including the HEP Benchmark Suite with Plugins, to address this gap. The new infrastructure is able to continuously monitors and validates the reported performance values by running benchmarks across the grid. This approach ensures the accuracy of annual performance figures, promotes transparency, and enables the timely detection and correction of incorrect values with minimal effort from the sites.

        Speaker: Natalia Diana Szczepanek (CERN)
      • 16:45
        ARC 7 - new ARC major release - and future plans 20m

        The Nordugrid Advanced Resource Connector Middleware (ARC) will manifest itself as ARC 7 this spring, after a long release preparation process. ARC 7 represents a significant advancement in the evolution of the Advanced Resource Connector Middleware, building upon elements introduced in the ARC 6 release from 2019, and refined over the subsequent years.

        This new version consolidates technological developments from the ARC 6 line, focusing on streamlining ARC Compute Element (CE) interfaces and enhancing the codebase with extensive functionality and comprehensive code cleanup. One of the notable updates in ARC 7 is the adoption of the REST interface as the primary method for job management and information retrieval within the ARC CE, as well as token-based
        authentication and authorization support, with x509 proxy support still in place.

        Despite being a major release, ARC 7 maintains a strong commitment to backward compatibility with ARC 6 CEs. Except for deprecated components, existing ARC 6 server configurations are expected to integrate seamlessly with the ARC 7 release.

        In this presentation we will go through the main changes in ARC 7 compared to ARC 6. We will also present what are the future development plans for ARC.

        Speaker: Mattias Wadenstein (University of Umeå (SE))
      • 17:05
        MTCA starterkits of powerBridge 15m

        MTCA starterkits next step evolution

        In this presentation, you will learn more about the powerBridge starterkits. The starterkits from powerBridge do include MTCA.0, Rev. 3 changes as well as new exciting products, including payload cards and are available in different sizes and flavours. They do allow an easy jumpstart for new MTCA users.

        Speaker: Thomas Holzapfel
    • 17:20 17:45
      Software and Services for Operation
      Convener: Jingyan Shi (Chinese Academy of Sciences (CN))
      • 17:20
        Windows device management at CERN: A new era 25m

        More than 10,000 Windows devices are managed by the Windows team and delegated administrators at CERN. Ranging from workstations on which scientists run heavy simulation software, to security-hardened desktops in the administrative sector and Windows Servers that manage some of the most critical systems in the Organisation – today these systems are managed using a unified MDM solution named CMF (Computer Management Framework), developed at CERN more than 20 years ago.

        As the next step into the future generation of Windows device management, a new framework to manage Windows devices at CERN has been designed and is gradually being implemented, it leverages the two MDM systems from Microsoft, Intune and Configuration Manager, which allow for co-management of both desktops and servers. The new solution expands the functionality of the former system and aligns CERN's Windows park to industry best practices, facilitating its administration and reinforcing its security posture.

        This presentation will describe CERN's path to implement both systems, including the technical challenges, such as the adaptation for interoperability with our open-source SSO, while maintaining compatibility with CERN's well-established Windows infrastructure.

        Speaker: Siavas Firoozbakht (CERN)
    • 18:00 19:30
      HEPIX Board (closed meeting) 1h 30m
    • 09:00 10:00
      Cloud Technologies, Virtualization & Orchestration, Operating Systems
      Convener: Mr Dino Conciatore (CSCS (Swiss National Supercomputing Centre))
      • 09:00
        Keeping the LHC colliding: Providing Extended Lifecycle support for EL7 20m

        The operation of the Large Hadron Collider (LHC) is critically dependent on several hundred Front-End Computers (FECs), that manage all facets of its internals. These custom systems were not able to be upgraded during the long shutdown (LS2), and with the coinciding end-of-life of EL7 of 30.06.2024, this posed a significant challenge to the successful operation of Run 3.

        This presentation will focus on how CERN IT is providing the Red Hat "Extended Lifecycle Support" (ELS) product across the CERN accelerator sector. We will discuss how this solution ensures operational continuity by maintaining software support for legacy hardware, bridging the gap between aging infrastructure and current security requirements. Technical details on how this is achieved, as well as shortcomings and lessons learned will be shared with the audience.

        Speaker: Ben Morrice (CERN)
      • 09:20
        Roadmap to LS3: CERN’s Linux Strategy 20m

        As CERN prepares for the third Long Shutdown (LS3), its evolving Linux strategy is critical to maintaining the performance and reliability of its infrastructure. This presentation will outline CERN’s roadmap for Linux leading up to LS3, highlighting the rollout of RHEL and AlmaLinux 10 to ensure stability and adaptability within the Red Hat ecosystem. In parallel, we will discuss efforts to enhance the adoption of Debian as a robust alternative, bolstering flexibility and long-term sustainability as part of a comprehensive dual-ecosystem approach.

        Speaker: Ben Morrice (CERN)
      • 09:40
        Exploring SUSE Open-Source Technology for Your Datacenter 20m

        This talk provides an overview of SUSE’s open-source solutions for modern data centers. We will discuss how SUSE technologies support various workloads while leveraging open-source flexibility and security.
        Topics include:
        - OpenSUSE Linux – A secure and open Linux system designed for
        high-performance workloads.
        - Harvester Project – An open-source alternative for virtualization,
        suitable for various use cases.
        - Rancher Project – A Kubernetes management platform for deploying and
        maintaining clusters.
        - NeuVector Project – A Kubernetes security platform for protecting
        clusters against threats.
        - Longhorn Project – A cloud-native storage solution designed for
        Kubernetes environments.

        Speaker: Mr Nikolaj Majorov (SUSE)
    • 10:00 10:30
      Network & Security
      Convener: Shawn Mc Kee (University of Michigan (US))
      • 10:00
        Update from the HEPiX IPv6 Working Group 30m

        The HEPiX IPv6 Working Group has been encouraging the deployment of IPv6 in WLCG and elsewhere for many years. At the last HEPiX meeting in November 2024 we reported on the status of our GGUS ticket campaign for WLCG sites to deploy dual-stack computing elements and worker nodes. Work on this has continued. We have also continued to monitor the use of IPv4 and IPv6 on the LHCOPN, with the aim to identify uses of legacy IPv4 data transfers and to remove these. A dual-stack network is not the desirable end-point for all this work; we continue to plan the move from dual-stack to IPv6-only.

        Speaker: Bruno Heinrich Hoeft (KIT - Karlsruhe Institute of Technology (DE))
    • 10:30 11:00
      Coffee Break 30m
    • 11:00 12:15
      Network & Security
      Convener: Shawn Mc Kee (University of Michigan (US))
      • 11:00
        Activities Update from the Research Networking Technical Working Group 20m

        The high-energy physics community, along with the WLCG sites and Research and Education (R&E) networks have been collaborating on network technology development, prototyping and implementation via the Research Networking Technical working group (RNTWG) since early 2020. The group is focused on three main areas: Network visibility, network optimization and network control and management.

        We will describe the status of ongoing activities in the group,including topics like SciTags, traffic and host optimization, SDN testing and its involvement in WLCG data challenge activities. We will also discuss near to long term plans for the group and the goal to identify beneficial capabilities and get them into production before the next WLCG Network Data Challenge in early 2027.

        Speaker: Shawn Mc Kee (University of Michigan (US))
      • 11:20
        WLCG Network Monitoring Infrastructure and perfSONAR Evolution 20m

        The WLCG Network Throughput Working Group along with its collaborators in OSG, R&E networks and the perfSONAR team have collaboratively operated, managed and evolved a network measurement platform based upon the deployment of perfSONAR toolkits at WLCG sites worldwide.

        This talk will focus on the status of the joint WLCG and IRIS-HEP/OSG-LHC infrastructure, including the resiliency and robustness issues found while operating a global perfSONAR deployment. It will also discuss the work to analyze network metrics to help identify and locate problems in the network. Finally, it will outline the plans for the next evolution of the infrastructure.

        Speaker: Shawn Mc Kee (University of Michigan (US))
      • 11:40
        Single Sign-On Evolution at CERN 20m

        The Single Sign-On (SSO) service at CERN has undergone a significant evolution over recent years, transitioning from a Puppet-hosted solution to a Kubernetes-based infrastructure. Since September 2023, the current team has focused on cementing SSO as a stable and reliable cornerstone of CERN's IT services. Effort was concentrated on implementing best practices in service management - a mid term investment that is already proving worthwhile.

        This presentation highlights the strides made in consolidating and modernizing the SSO service. Key achievements include the successful migration from Keycloak 20 to Keycloak 24 and significant improvements in monitoring using Grafana, disaster recovery preparation, and proactive alerting through Telegram and Mattermost.

        We also showcase the advantages of Keycloak as a central identity management solution for CERN. Keycloak's extensibility lies in its ability to support custom development through Java-based Service Provider Interfaces (SPIs) to meet specific organizational needs. By implementing these SPIs, the team was able to bridge the gap between modern identity protocols and CERN's diverse legacy systems.

        Furthermore, the team has implemented proactive configuration control measures, such as exporting Keycloak realm configurations to GitLab, enabling transparency and traceability for changes made to the SSO configuration.

        Speaker: Paul Van Uytvinck (CERN)
      • 12:00
        Network design and implementation status of HEPS 15m

        Introduce the network architecture design of HEPS, including the general network, production network and data center network and etc.
        The running status for all the network parts will also be described.

        Speaker: 曾珊 zengshan
    • 12:15 13:30
      Lunch 1h 15m
    • 13:30 15:15
      Computing and Batch Services
      Convener: Matthias Jochen Schnepf
      • 13:30
        Summary of the 2024 autumns european HTC workshop 20m

        The tenth european HTCondor workshop took place at NIKHEF Amsterdam autumn last year and as always covered most if not all aspects of up-to-date high throughput computing.

        Here comes a short summary of the parts of general interest if you like :)

        Speaker: Christoph Beyer
      • 13:50
        Current and Future Accounting with AUDITOR 15m

        In the realm of High Throughput Computing (HTC), managing and processing large volumes of accounting data across diverse environments and use cases presents significant challenges. AUDITOR addresses this issue by providing a flexible framework for building accounting pipelines that can adapt to a wide range of needs.
        At its core, AUDITOR serves as a centralized storage solution for accounting records, facilitating data exchange through a REST interface. This enables seamless interaction with the other parts of the AUDITOR ecosystem: the collectors, which gather accounting data from various sources and push it to AUDITOR, and the plugins, which pull data from AUDITOR for subsequent processing. The modular nature of AUDITOR allows for the customization of collectors and plugins to match specific use cases and environments, ensuring a tailored approach to the management of accounting data.
        Future use cases that could be realized with AUDITOR are e.g. the accounting of GPU resources or the accounting of variable core power values of computing nodes due to dynamic adjustments of the CPU clock frequency.

        This presentation will outline the structure of the AUDITOR accounting ecosystem, demonstrate existing accounting pipelines, and show how AUDITOR could be extended to account environmentally sustainable computing resources.

        Speaker: Dirk Sammel (University of Freiburg (DE))
      • 14:05
        Provisioning and Usage of GPUs at GridKa 15m

        For years, GPUs have become increasingly interesting for particle physics. Therefore, GridKa provides some GPU machines to the Grid and the particle physics institute at KIT.
        Since GPU usage and provisioning differ from CPUs, some development on the provider and user side is necessary.
        The provided GPUs allow the HEP community to use GPUs in the Grid environment and develop solutions for efficiently using powerful and expensive devices.
        We present our experiences with the GPU machine from the site perspective, including job scheduling, studied technologies, and their usage by the Grid and the local particle physics institute.

        Speaker: Matthias Jochen Schnepf
      • 14:20
        User and "queue" caps in HTCondor 15m

        At Nikhef, we've based much of our "fairness" policy implementation around User, group, and job-class (e.g. queue) "caps", that is, setting upper limits on the number of simultaneous jobs (or used cores). One of the main use cases for such caps is to prevent one or two users from acquiring the whole cluster for days at a time, blocking all other usage.

        When we started using HTCondor, there was no obvious way to implement these caps. We recently discovered a native HTCondor functionality that allowed us to implement these caps with minimal extra configuration. This talk will explain.

        Speaker: Jeff Templon
      • 14:35
        Efficiency of job processing in many-core Grid and HPC environments 20m

        Developments in microprocessor technology have confirmed the trend towards higher core-counts and decreased amount of memory per core, resulting in major improvements in power efficiency for a given level of performance. Per node core-counts have increased significantly over the past five years for the x86_64 architecture, which is dominating in the LHC computing environment, and the higher core density is not only a feature of large HPC systems, but is also readily available on commodity hardware preferentially used at Grid sites. The baseline multi-core workloads are however still largely based on 8 cores, and the LHC experiments employ different strategies for scheduling their payloads at sites. In this work we investigate possible implications of scaling up core-counts for grid jobs, up to whole node where possible.

        Speaker: Gianfranco Sciacca (Universitaet Bern (CH))
      • 14:55
        Smart HPC-QC: flexible approaches for Quantum workloads integration 20m

        Many efforts have tried to combine the HPC and QC fields, proposing integrations between quantum computers and traditional clusters. Despite these efforts, the problem is far from solved, as quantum computers face a continuous evolution. Moreover, nowadays, quantum computers are scarce compared to the traditional resources in the HPC clusters: managing the access from the HPC nodes is non-trivial, as it is easy to turn the accelerator into a bottleneck. Through the SmartHPC-QC project, we design solutions to this integration issue, defining interactions based on the application pattern and depending on the underlying technology of the quantum computer. The project aims to define an integration plan that can satisfy the users' different needs without burdening them with excessive technical complexity. To achieve this goal, we use various approaches, from more typical ones (workflow-based) to more niche solutions (like virtualisation and malleability).

        Speaker: Mr Simone Rizzo (E4 COMPUTER ENGINEERING Spa)
    • 15:15 15:30
      Miscellaneous
      Convener: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 15:15
        Preserving IT History for more than a decade at CC-IN2P3 15m

        Since its launch in 2011 CC-IN2P3's computer history museum has been visited by 13'000 people. It is home to more than 1000 artefacts, among which are France's first web server and a mysterious french micro-computer called the CHADAC.
        We will demonstrate through our experience and several examples that physical and digital preservation of IT infrastructure components while being a paramount task in its own right, also serves education and science at large.
        We will describe the scope of our activity, and discuss the challenges of running a computer museum in a French Tier-1. We shall underline the national and international context, and hopefully trigger interest from our community : what about HEPiX's heritage ?

        Speaker: Dr Fabien WERNLI
    • 15:30 16:00
      Coffee Break 30m
    • 16:00 16:45
      Miscellaneous
      Convener: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 16:00
        High-performance end-user analysis code; an example 20m

        With an increasing focus on green computing, and with the high luminosity LHC fast approaching, we need every bit of extra throughput that we can get. In this talk, I'll be exploring my old ATLAS analysis code, as an example of how improvements to end-user code can significantly better performance. Not only does this result in a more efficient utilisation of the available resources, it also decreases processing time, leading to a better end-user experience.
        I will be showing the modular set-up of the program, the freely extendable messaging system between the modules, the run-time configuration through a text-file, and its zero-copy variable implementation.

        Speaker: Dr Daniël Geerts (Nikhef)
      • 16:20
        CC-IN2P3 user documentation 15m

        CC-IN2P3 provides storage and computing resources to around 2,700 users. Other services reach an even larger community, such as GitLab and its 10,000 users. It is therefore vital for the CC-IN2P3 to provide an accurate user documentation.

        In this presentation, we'll give an experience feedback of five years managing CC-IN2P3 user documentation. We will begin outlining outline the reasons behind the technology choice supporting the actual online publications. Then, we will discuss the contents updates managing procedures. Finally, we'll explain the software stack and the automation put in place to deploy and keep up to date this essential user support service.

        Speaker: Gino MARCHETTI
      • 16:35
        RAL use case of XRootD Managers 10m

        RAL makes use of the XRootD Cluster Management System to manage our
        XRootD server frontends for disk based storage (ECHO).
        In this session, I'll give an overview of our configuration, custom scripts used and observations on its interaction on different setups.

        Speaker: Thomas Jyothish (STFC)
    • 17:30 19:00
      Visit to CSCS + group photo 1h 30m CSCS

      CSCS

    • 19:45 22:00
      Social dinner 2h 15m Braceria Helvetica

      Braceria Helvetica

      Quartiere Maghetti, 6900 Lugano
    • 09:00 10:35
      Software and Services for Operation
      Convener: Dennis van Dok (Nikhef)
      • 09:00
        JUNO Distributed Computing Infrastructure and Services Monitoring System 20m

        JUNO is an international collaborative neutrino experiment located in Kaiping City, southern China. The JUNO experiment employs a WLCG-based distributed computing system for official data production. The JUNO distributed computing sites are from China, Italy, France, and Russia. To monitor the operational status of the distributed computing sites and other distributed computing services, as well as to account the cumulative resource consumption at these sites, we have developed a distributed computing infrastructure monitoring system. This service utilizes a unique workflow management tool to execute site SAM Tests. Currently, the monitoring system provides data collection and visualization services across several key areas, including site operational status, data transfer status, traditional data statistics, and service operational status.

        Speaker: Xuantong Zhang (Institute of High Enegry Physics, Chinese Academy of Sciences (CN))
      • 09:20
        Integrated Configuration Management at Karlsruhe Institute of Technology (KIT) 15m

        At KIT we operate more than 800 hosts to run the Large Scale Data Facility (LSDF) and the WLCG Tier1 center GridKa. Thereby, our Config Management efforts aim for a reliable, consistent and reproducible host deployment which allows for unattended mass deployment of stateless machines like the GridKa Compute Farm. In addition, our approach supports efficient patch management to tackle security challenges and rapid recovery of the entire infrastructure after a cyber attack. Further, we relieve our system administrators from low level tasks and enable them to focus on conceptual work and user support. This leads to highly specialized staff that focuses on dedicated services instead of entire hosts.

        In our presentation we will talk about the architecture of our Configuration Management System and the technologies we use. This includes puppet, hiera, foreman, gitlab and various interfaces to external services like DNS or Certificate Authorities.

        Speaker: Nico Schlitter (Karlsruhe Institute of Technology (DE)))
      • 09:35
        From Batch to Interactive: The "INK" for High Energy Physics Data Analysis at IHEP 20m

        IHEP computing platform faces new requirement in data analysis, including limited access to login nodes, increasing demand for code debugging tools, and efficient data access for collaborative workflows.. We have developed an Interactive aNalysis workbench (INK), a web-based platform leveraging the HTCondor cluster. This platform transforms traditional batch-processing resources into a user-friendly, web-accessible interface, enabling researchers to utilize cluster computing and storage resources directly through their browsers. A loosely coupled architecture with token-based access ensures platform security, while fine-grained permission management allows customizable access for users and experimental applications. Universal public interfaces abstract the heterogeneity of underlying resources, ensuring environment consistency and seamless integration with interactive analysis tools. Initial feedback from a pilot group of users has been highly positive. The platform is now in its final testing phase and will soon be officially deployed for all users.

        Speaker: Dr Jingyan Shi (IHEP)
      • 09:55
        Grafana dashboards as code with Jsonnet 20m

        Grafana dashboards are easy to make but hard to maintain. Since changes can be made easily, the questions that remain are how to avoid changes that overwrite other work? How to keep track of changes? And how to communicate these to the user? Another question that pops up frequently is how to apply certain changes consistently to multiple visualizations and dashboards. One partial solution is to export Grafana dashboards to their JSON representation and store those in a git repository. However, even simple dashboards can quickly run into the thousands of lines of JSON and version-controlling these is problematic in its own right: the diffs are large and changes do not easily carry over to JSON representations of other dashboards. Instead, we propose to use the Jsonnet configuration language, together with its library Grafonnet, to create so called 'dashboard definitions' that compile into JSON representations of dashboards. We have adopted this solution to manage multiple different dashboards with around 20 visualizations each. This leads to a clear improvement in the maintainability and deployment of these dashboards. In this contribution, we will show how we created complex dashboards with multiple types of visualizations and data sources using functional code and Jsonnet. Additionally, we will show how dashboards-as-code can be integrated in git repositories and their CI/CD pipelines to ensure consistency between the dashboard definitions and the dashboards in a Grafana instance.

        Speaker: Ewoud Ketele (CERN)
      • 10:15
        MarmotGraph @ CSCS - A knowledge graph for linked HPC data 20m

        A High-Performance Computing (HPC) center typically consists of various domains. From the physical world (hardware, power supplies, etc.) up to highly abstracted and virtualized, dynamic execution environments (cloud infrastructures, software, and service dependencies, central services, etc.). The tools used to manage those different domains are as heterogeneous as the domains themselves. Accordingly, information on how the different layers are designed, set up, and interconnected is spread across various systems, databases, and persons within the organization. Keeping the information consistent across the domain specific tools and gaining an overarching representation of the center is a huge challenge.

        At CSCS, we're trying to approach this issue by introducing a central knowledge graph. Within the European research project "EBRAINS 2.0" (successor of the "Human Brain Project"), we're already developing and operating a knowledge graph solution. Whilst this solution has proven its capabilities in a neuroscientific context for more than 6 years, we're now extending it for this new use-case by integrating multi-tenancy capabilities and preparing it to become a generally applicable product under the name of “MarmotGraph”. By extracting information from various existing tools in the HPC center and applying them to a common linked metadata model, we can not only make information more accessible to the whole organization but also detect inconsistencies or delays in eventually consistent data states.

        In this session, we will present the current state of the development of our solution as well as the designed model, and discuss the challenges, chances, and risks involved with the implementation of a centralized knowledge management system such as the MarmotGraph.

        Speaker: Oliver Schmid
    • 10:35 11:00
      Coffee Break 25m
    • 11:00 12:00
      Miscellaneous
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 11:00
        Managing Microsoft SQL Infrastrucuture at CERN 15m

        For its operations, CERN depends on an extensive range of applications, achievable only through the use of diverse technologies, including more than one relational database management system (RDBMS). This presentation provides an overview of CERN’s Microsoft SQL Server (MSSQL) infrastructure, highlighting how we manage servers and design solutions for large-scale databases with different compatibilities and criticality. We will explore the advantages and challenges of using MSSQL and how we provide high availability for critical databases, tailored solutions for applications, and different environments for production and testing. Furthermore, we will discuss our backup infrastructure and data integrity tests to ensure disaster recovery.

        Speaker: Ricardo Martins Goncalves
      • 11:15
        PIC's Big Data Analysis Facility 15m

        The Port d'Informació Científica (PIC) provides advanced data analysis services to a diverse range of scientific communities.

        This talk will detail the status and evolution of PIC's Big Data Analysis Facility, centered around its Hadoop platform. We will describe the architecture of the Hadoop cluster and the services running on top, including CosmoHub, a web application that exemplifies the FAIR principles for interactive exploration of large astronomical catalogs, and
        Scipic, a pipeline to generate mock galaxy catalogs.

        We will explain how the Hadoop platform integrates with other PIC services, such as HTCondor for backfilling and JupyterHub for interactive computing and data analysis.

        Furthermore, the talk will cover PIC's roadmap for these services, including plans to federate with external identity providers (e.g., eduGAIN), enhance CosmoHub with an IVOA TAP endpoint for improved data accessibility and interoperability, and ongoing efforts to support multi-messenger astronomy through CosmoHub extensions.

        Speaker: Francesc Torradeflot
      • 11:30
        Machine learning for developers and administrators 15m

        Developing and managing computing systems is complex due to rapidly changing technology, evolving requirements during development, and ongoing maintenance throughout their lifespan. Significant post-deployment maintenance includes troubleshooting, patching, updating, and modifying components to meet new features or security needs. Investigating unusual events may involve reviewing system descriptions, administrator archives, administrative orders, official recommendations, and system logs. The primary goal is to keep the investigation time within reasonable limits. The machine learning system, Retrieval Augmented Generation (RAG), has been steadily advancing since around 2021. RAG can be regarded as a form of knowledge transfer. In the case studied, large computing systems are recognized as the application point of RAG, which includes a large language model (LLM) as a collaborator for the development team. This approach offers advantages during the development process of computing systems and in the exploitation phase.

        Speaker: Andrey Shevel (Petersburg Nuclear Physics Institute, University of Information Technology, Mechanics and Optics)
      • 11:45
        Enabling Accessibility to CERN audiovisual content via Automated Speech Recognition 15m

        A key stepping stone in promoting diversity and accessibility at CERN consists in providing users with subtitles for all CERN-produced multimedia content. Subtitles not only enhance accessibility for individuals with impairments and non-native speakers but also make what would otherwise be opaque content fully searchable. The “Transcription and Translation as a Service” (TTaaS) project [1] addresses this need by offering a high-performance, privacy-preserving, and cost-efficient Automated Speech Recognition (ASR) and translation system for both existing and newly created audiovisual materials, including videos and webcasts.

        The TTaaS solution is powered by state-of-the-art technology developed by the MLLP group [2] at the Universitat Politècnica de València. Over the past two years, the service has processed more than 30,000 hours of CERN media, delivering accurate transcripts and translations to ensure accessibility for a global audience. The system has also been tested for live ASR during several CERN/HEP conferences and events.

        This presentation will provide an in-depth look at the TTaaS solution, including its core technologies, operational workflows, integration with CERN IT services, and its significant role in making CERN’s multimedia content accessible to all.

        [1] https://ttaas.docs.cern.ch/
        [2] https://www.mllp.upv.es/

        Speaker: Ruben Domingo Gaspar Aparicio (CERN)
    • 12:00 12:30
      Wrap-up
      Convener: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 12:00
        Wrap-up 30m
        Speaker: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
    • 12:30 13:45
      Lunch 1h 15m