KEK is promoting various accelerator science projects by fully utilizing the electron accelerator in Tsukuba and the proton accelerator in Tokai.
These projects require a large amount of data processing, and our central computing system, KEKCC, takes a key role in their success. KEKCC
also has an aspect that works as part of the Grid system, which is essential to the Bell II project.
We...
Evolution of the NCG-INGRID-PT site and future perspectives.
New development in the distributed Nordic Tier-1 and it's participant sites.
An update on recent developments at the Scientific Computing and Data Facilities (SCDF) at BNL.
An update on recent advancements at the US ATLAS SouthWest Tier2 Center (UTA/OU).
This report introduce the LHCb Tier-2 site at Lanzhou University(LZU-T2), which is a major new computing resource designed to support the LHCb experiment. It is part of the Worldwide LHC Computing Grid, which distributes data processing and storage across a network of international computing centers. The LZU-T2 site plays a critical role in processing, analyzing, and storing the vast amounts...
News from CERN since the last HEPiX workshop. This talk gives a general update from services in the CERN IT department.
EOS is an open-source storage system developed at CERN that is used as the main platform to store LHC data. The architecture of the EOS system has evolved over the years to accommodate ever more diverse use-cases and performance requirements coming both from the LHC experiments as well as from the user community running their analysis workflows on top of EOS. In this presentation, we discuss...
We report on our experience with the production backup orchestration via “cback”, a tool developed at CERN and used to back up our primary mounted filesystem offerings: EOS (eosxd) and Ceph (CephFS). In a storage system that handles non-reproducible data, a robust backup and restore system is essential for effective disaster recovery and business continuity. When designing a backup solution,...
The 50-year-old Meyrin Data Centre (MDC), still remains indispensable due to its strategic geographical location and unique electrical power resilience even if CERN IT recently commissioned the Prévessin Data Centre (PDC), doubling the organization’s hosting capacity in terms of electricity and cooling. The Meyrin Data Centre (Building 513) retains an essential role for the CERN Tier-0 Run 4...
On the 30th of June 2024, the end of CentOS 7 support marked a new era for the operation of the multi-petabytes distributed disk storage system used by CERN physics experiments. The EOS infrastructure at CERN is composed of aproximately 1000 disk servers and 50 metadata management nodes. Their transition from CentOS 7 to Alma 9 was not as straightforward as anticipated.
This presentation...
Traditional filesystems organize data into directories based on a single criterion, such as the starting date of the experiment, experiment name, beamline ID, measurement device, or instrument. However, each file within a directory can belong to multiple logical groups, such as a special event type, experiment condition, or part of a selected dataset. dCache, a storage system designed to...
The NVMe HDD Specification were released back in 2022, but only very early Engineering Demo Units have been created so far from a single source. That said, the market demand is definitely growing, and the industry must pay attention to the potential TCO and storage stack optimizations that a unified NVMe storage interface could offer. In this session, we will go over the TCO analysis details...
In this presentation we try to give an update on CPU, GPU and AI accelerators in the market today.
The objective of this talk is to share the tentative plan of energy efficiency status review by the TechWatch WG. Progress of the primary tasks such as reviewing and understanding of the trends of industry, market and technology, efforts of WLCG and sites, as well as the strategy from measurement/ data collection/ analysis/ modeling/ to estimation will be shared. Through this report, the...
Nikhef has recently renovated their building and upgraded almost everything to the latest standards. Including the Audio/Video setup in the new meetingrooms.
This talk will give an insight in the proces from choosing which technologies and tendering to installation, testing and getting everything working. What went wrong and what not. How you would think that 4K 60Hz is easy these days. Why...
Cherenkov Telescope Array Observatory (CTAO) is a next-generation ground-based gamma-ray astronomical observatory that is in construction phase on two sites: in Northern and Southern hemispheres. CTAO telescopes use the atmosphere as giant detector of high-energy particles. CTAO data contain "events" of extensive air showers of high-energy particles. Most of the showers are induced by charged...
The progress and status of IHEP site since last Hepix.
The Cherenkov Telescope Array Observatory (CTAO) is the next-generation gamma-ray telescope facility, currently under construction.
The CTAO recently reached a set of crucial milestones: it has been established as a European Research Infrastructure Consortium (ERIC), all four Large-Sized Telescopes at the northern site of the Observatory reached key construction milestones, and the first...
The Infrastructure Monitoring helps to control and monitor in real-time servers and applications involved in the operation of the WLCG Tier1 center GridKa, including the online and tape storages, the batch system and the GridKa network.
Monitoring data like server metrics (CPU, Memory, Disk, Network), storage operations (I/O Statistics) or visualizing real-time sensors data such as...
The Swiss National Supercomputing Centre (CSCS) is committed to sustainable high-performance computing. This talk will explore how CSCS leverages lake water for efficient cooling, significantly reducing energy consumption. Additionally, we will discuss the reuse of waste heat to support local infrastructure, demonstrating a practical and efficient approach to sustainability in supercomputing.
This study presents analyses of natural job drainage and power reduction patterns in the PIC Tier-1 data center, which uses HTCondor for workload scheduling. By examining historical HTCondor logs from 2023 and 2024, we simulate natural job drainage behaviors, in order to understand natural job drainage patterns: when jobs naturally conclude without external intervention. These findings provide...
The CERN Cloud Infrastructure Service provides access to large compute and storage resources for the laboratory that includes virtual and physical machines, volumes, fileshares, loadbalancers, etc. across 2 different datacenters. With the recent addition of the Prevessin Data Center, one of the main objectives of the CERN IT Department is to ensure that all services have up-to-date procedures...
Extending the data presented at the last few HEPiX workshops, we present new measurements on the energy efficiency (HEPScore/Watt) of the recently available AmpereOne-ARM and AMD Turin-x86 machines.
The Technology Watch Working Group, established in 2018 to take a close look at the evolution of the technology relevant to HEP computing, has resumed its activities after a long pause. In this report, we provide an overview of the hardware technology landscape and some recent developments, highlighting the impact on the HEP computing community.
DESY operates the IDAF (Interdisciplinary Data and Analysis Facility) for all science branches: high energy physics, photon science, and accelerator R&D and operations.
The NAF (National Analysis Facility) is an integrated part, and acts as an analysis facility for the German ATLAS and CMS community as well as the global BELLE II community since 2007.
This presentation will show the current...
Transition of German University Tier-2 Resources to HPC Compute and Helmholz Storage
The March 2022 perspective paper of the German Committee for Elementary Particle Physics proposes a transformation of the provision of computing resources in Germany. In preparation for the HL-LHC, the German university Tier 2 centres are to undergo a transition towards a more resource-efficient and...
Radio astronomers are engaged in an ambitious new project to detect faster, fainter, and more distant astrophysical phenomena using thousands of individual radio receivers linked through interferometry. The expected deluge of data (up to 300 PB per year) poses a significant computational challenge that requires rethinking and redesigning the state-of-the-art data analysis pipelines.
The SKA Observatory is expected to be producing up to 600 petabytes of scientific data per year, which would set a new record in data generation within the field of observational astronomy. The SRCNet infrastructure is meant for handling these large volumes of astronomy data, which requires a global network of distributed regional centres for the data- and compute-intensive astronomy use...
The University of Victoria operates a scientific OpenStack cloud for Canadian researchers, and the CA-VICTORIA-WESTGRID-T2 grid site for the ATLAS experiment at CERN. We are shifting both of these service offerings towards a Kubernetes-based approach. We have exploited the batch capabilities of Kubernetes to run grid computing jobs and replace the conventional grid computing elements by...
This presentation provides a detailed overview of the hyper-converged cloud infrastructure implemented at the Swiss National Supercomputing Centre (CSCS). The main objective is to provide a detailed overview of the integration between Kubernetes (RKE2) and ArgoCD, with Rancher acting as a central tool for managing and deploying RKE2 clusters infrastructure-wide.
Rancher is used for direct...
This presentation will explain the network design implemented in the CERN Prévessin Datacentre (built in 2022/2023, in production since February 2024). We will show how, starting from an empty building, the current network best practices could be adopted (and partly adapted to match the specific requirements in term of interconnection with the rest of CERN network). We will also provide...
High-Performance Computing (HPC) environments demand extreme speed and efficiency, making cybersecurity particularly challenging. The need to implement security controls without compromising performance presents a unique dilemma: how can we ensure robust protection while maintaining computational efficiency?
This presentation will give an insight into real-world challenges and measures...
The deployment of an Endpoint Detection & Response (EDR) solution at CERN has been a project aimed at enhancing the security posture of endpoint devices. In this presentation we’ll share our infrastructure's architecture and how we rolled out the solution. We will also see how we addressed and overcome challenges on multiple fronts from administrator’s fears to fine-tuning detections and...
This presentation aims to give an update on the global security landscape from the past year. The global political situation has introduced a novel challenge for security teams everywhere. What's more, the worrying trend of data leaks, password dumps, ransomware attacks and new security vulnerabilities does not seem to slow down.
We present some interesting cases that CERN and the wider HEP...
The storage needs of CERN’s OpenStack cloud infrastructure are fulfilled by Ceph, which provides diverse storage solutions including volumes with Ceph RBD, file sharing through CephFS, and S3 object storage via Ceph RadosGW. The integration between storage and compute resources is possible thanks a to close collaboration between OpenStack and Ceph teams. In this talk we review the architecture...
This presentation with start with the evolution of the tape technology market in the recent years and the expectations from the INSIC roadmap.
From there, with LHC now in the middle of Run 3, we will reflect on the evolution of our capacity planning vs. increasing storage requirements of the experiments. We will then describe our current tape hardware setup and present our experience with...
The CERN Tape Archive (CTA) software is used for physics archival at CERN and other scientific institutes. CTA’s Continuous Integration (CI) system has been around since the inception of the project, but over time several limitations have become apparent. The migration from CERN CentOS 7 to Alma 9 introduced even more challenges. The CTA team took this as an opportunity to make significant...
The CERN Tape Archive (CTA) is CERN’s Free and Open Source Software system for data archival to tape. Across the Worldwide LHC Computing Grid (WLCG), the tape software landscape is quite heterogeneous, but we are entering a period of consolidation. A number of sites have reevaluated their options and have chosen CTA for their tape archival storage needs. To facilitate this, the CTA team have...
Storage Technology Outlook
The rapid growth of data has outpaced traditional hard disk drive (HDD) scaling, leading to challenges in cost, capacity, and sustainability. This presentation examines the trends in storage technologies highlighting the evolving role of tape technology in archive solutions. Unlike HDDs, tape continues to scale without hitting fundamental physics barriers, offering...
o The most common mechanical failures in today's modern HDDs in the datacenter are no longer due to motor/actuator failures of head crashes. The great majority of these failures are due to Writer head degradation with time and heat, a small minority to Reader failures and a very small number of failures are due to other causes. The scope of this presentation is to present and discuss the...
The Benchmarking Working Group (WG) has been actively advancing the HEP Benchmark Suite to meet the evolving needs of the Worldwide LHC Computing Grid (WLCG). This presentation will provide a comprehensive status report on the WG’s activities, highlighting the intense efforts to enhance the suite’s capabilities with a focus on performance optimization and sustainability.
In response to...
The performance score per CPU core — corepower — reported annually by WLCG sites is a critical metric for ensuring reliable accounting, transparency, trust, and efficient resource utilization across experiment sites. It is therefore essential to compare the published CPU corepower with the actual runtime corepower observed in production environments. Traditionally, sites have reported annual...
The Nordugrid Advanced Resource Connector Middleware (ARC) will manifest itself as ARC 7 this spring, after a long release preparation process. ARC 7 represents a significant advancement in the evolution of the Advanced Resource Connector Middleware, building upon elements introduced in the ARC 6 release from 2019, and refined over the subsequent years.
This new version consolidates...
MTCA starterkits next step evolution
In this presentation, you will learn more about the powerBridge starterkits. The starterkits from powerBridge do include MTCA.0, Rev. 3 changes as well as new exciting products, including payload cards and are available in different sizes and flavours. They do allow an easy jumpstart for new MTCA users.
More than 10,000 Windows devices are managed by the Windows team and delegated administrators at CERN. Ranging from workstations on which scientists run heavy simulation software, to security-hardened desktops in the administrative sector and Windows Servers that manage some of the most critical systems in the Organisation – today these systems are managed using a unified MDM solution named...
The operation of the Large Hadron Collider (LHC) is critically dependent on several hundred Front-End Computers (FECs), that manage all facets of its internals. These custom systems were not able to be upgraded during the long shutdown (LS2), and with the coinciding end-of-life of EL7 of 30.06.2024, this posed a significant challenge to the successful operation of Run 3.
This presentation...
As CERN prepares for the third Long Shutdown (LS3), its evolving Linux strategy is critical to maintaining the performance and reliability of its infrastructure. This presentation will outline CERN’s roadmap for Linux leading up to LS3, highlighting the rollout of RHEL and AlmaLinux 10 to ensure stability and adaptability within the Red Hat ecosystem. In parallel, we will discuss efforts to...
This talk provides an overview of SUSE’s open-source solutions for modern data centers. We will discuss how SUSE technologies support various workloads while leveraging open-source flexibility and security.
Topics include:
- OpenSUSE Linux – A secure and open Linux system designed for
high-performance workloads.
- Harvester Project – An open-source alternative for virtualization,
...
The HEPiX IPv6 Working Group has been encouraging the deployment of IPv6 in WLCG and elsewhere for many years. At the last HEPiX meeting in November 2024 we reported on the status of our GGUS ticket campaign for WLCG sites to deploy dual-stack computing elements and worker nodes. Work on this has continued. We have also continued to monitor the use of IPv4 and IPv6 on the LHCOPN, with the aim...
The high-energy physics community, along with the WLCG sites and Research and Education (R&E) networks have been collaborating on network technology development, prototyping and implementation via the Research Networking Technical working group (RNTWG) since early 2020. The group is focused on three main areas: Network visibility, network optimization and network control and management....
The WLCG Network Throughput Working Group along with its collaborators in OSG, R&E networks and the perfSONAR team have collaboratively operated, managed and evolved a network measurement platform based upon the deployment of perfSONAR toolkits at WLCG sites worldwide.
This talk will focus on the status of the joint WLCG and IRIS-HEP/OSG-LHC infrastructure, including the resiliency and...
The Single Sign-On (SSO) service at CERN has undergone a significant evolution over recent years, transitioning from a Puppet-hosted solution to a Kubernetes-based infrastructure. Since September 2023, the current team has focused on cementing SSO as a stable and reliable cornerstone of CERN's IT services. Effort was concentrated on implementing best practices in service management - a mid...
Introduce the network architecture design of HEPS, including the general network, production network and data center network and etc.
The running status for all the network parts will also be described.
The tenth european HTCondor workshop took place at NIKHEF Amsterdam autumn last year and as always covered most if not all aspects of up-to-date high throughput computing.
Here comes a short summary of the parts of general interest if you like :)
In the realm of High Throughput Computing (HTC), managing and processing large volumes of accounting data across diverse environments and use cases presents significant challenges. AUDITOR addresses this issue by providing a flexible framework for building accounting pipelines that can adapt to a wide range of needs.
At its core, AUDITOR serves as a centralized storage solution for...
For years, GPUs have become increasingly interesting for particle physics. Therefore, GridKa provides some GPU machines to the Grid and the particle physics institute at KIT.
Since GPU usage and provisioning differ from CPUs, some development on the provider and user side is necessary.
The provided GPUs allow the HEP community to use GPUs in the Grid environment and develop solutions for...
At Nikhef, we've based much of our "fairness" policy implementation around User, group, and job-class (e.g. queue) "caps", that is, setting upper limits on the number of simultaneous jobs (or used cores). One of the main use cases for such caps is to prevent one or two users from acquiring the whole cluster for days at a time, blocking all other usage.
When we started using HTCondor, there...
Developments in microprocessor technology have confirmed the trend towards higher core-counts and decreased amount of memory per core, resulting in major improvements in power efficiency for a given level of performance. Per node core-counts have increased significantly over the past five years for the x86_64 architecture, which is dominating in the LHC computing environment, and the higher...
Many efforts have tried to combine the HPC and QC fields, proposing integrations between quantum computers and traditional clusters. Despite these efforts, the problem is far from solved, as quantum computers face a continuous evolution. Moreover, nowadays, quantum computers are scarce compared to the traditional resources in the HPC clusters: managing the access from the HPC nodes is...
Since its launch in 2011 [CC-IN2P3's computer history museum][1] has been visited by 13'000 people. It is home to more than 1000 artefacts, among which are [France's first web server][2] and a mysterious french micro-computer called the [CHADAC][3].
We will demonstrate through our experience and several examples that physical and digital preservation of IT infrastructure components while...
With an increasing focus on green computing, and with the high luminosity LHC fast approaching, we need every bit of extra throughput that we can get. In this talk, I'll be exploring my old ATLAS analysis code, as an example of how improvements to end-user code can significantly better performance. Not only does this result in a more efficient utilisation of the available resources, it also...
CC-IN2P3 provides storage and computing resources to around 2,700 users. Other services reach an even larger community, such as GitLab and its 10,000 users. It is therefore vital for the CC-IN2P3 to provide an accurate user documentation.
In this presentation, we'll give an experience feedback of five years managing CC-IN2P3 user documentation. We will begin outlining outline the reasons...
RAL makes use of the XRootD Cluster Management System to manage our
XRootD server frontends for disk based storage (ECHO).
In this session, I'll give an overview of our configuration, custom scripts used and observations on its interaction on different setups.
JUNO is an international collaborative neutrino experiment located in Kaiping City, southern China. The JUNO experiment employs a WLCG-based distributed computing system for official data production. The JUNO distributed computing sites are from China, Italy, France, and Russia. To monitor the operational status of the distributed computing sites and other distributed computing services, as...
At KIT we operate more than 800 hosts to run the Large Scale Data Facility (LSDF) and the WLCG Tier1 center GridKa. Thereby, our Config Management efforts aim for a reliable, consistent and reproducible host deployment which allows for unattended mass deployment of stateless machines like the GridKa Compute Farm. In addition, our approach supports efficient patch management to tackle security...
IHEP computing platform faces new requirement in data analysis, including limited access to login nodes, increasing demand for code debugging tools, and efficient data access for collaborative workflows.. We have developed an Interactive aNalysis workbench (INK), a web-based platform leveraging the HTCondor cluster. This platform transforms traditional batch-processing resources into a...
Grafana dashboards are easy to make but hard to maintain. Since changes can be made easily, the questions that remain are how to avoid changes that overwrite other work? How to keep track of changes? And how to communicate these to the user? Another question that pops up frequently is how to apply certain changes consistently to multiple visualizations and dashboards. One partial solution is...
A High-Performance Computing (HPC) center typically consists of various domains. From the physical world (hardware, power supplies, etc.) up to highly abstracted and virtualized, dynamic execution environments (cloud infrastructures, software, and service dependencies, central services, etc.). The tools used to manage those different domains are as heterogeneous as the domains themselves....
For its operations, CERN depends on an extensive range of applications, achievable only through the use of diverse technologies, including more than one relational database management system (RDBMS). This presentation provides an overview of CERN’s Microsoft SQL Server (MSSQL) infrastructure, highlighting how we manage servers and design solutions for large-scale databases with different...
The Port d'Informació Científica (PIC) provides advanced data analysis services to a diverse range of scientific communities.
This talk will detail the status and evolution of PIC's Big Data Analysis Facility, centered around its Hadoop platform. We will describe the architecture of the Hadoop cluster and the services running on top, including CosmoHub, a web application that exemplifies...
Developing and managing computing systems is complex due to rapidly changing technology, evolving requirements during development, and ongoing maintenance throughout their lifespan. Significant post-deployment maintenance includes troubleshooting, patching, updating, and modifying components to meet new features or security needs. Investigating unusual events may involve reviewing system...
A key stepping stone in promoting diversity and accessibility at CERN consists in providing users with subtitles for all CERN-produced multimedia content. Subtitles not only enhance accessibility for individuals with impairments and non-native speakers but also make what would otherwise be opaque content fully searchable. The “Transcription and Translation as a Service” (TTaaS) project [1]...