Conveners
Track 7 – Facilities, Clouds and Containers: Monitoring and benchmarking
- Oksana Shadura (University of Nebraska Lincoln (US))
Track 7 – Facilities, Clouds and Containers: Cloud computing
- Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))
Track 7 – Facilities, Clouds and Containers: Trends and new Approaches
- Oksana Shadura (University of Nebraska Lincoln (US))
Track 7 – Facilities, Clouds and Containers: Infrastructure
- Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
Track 7 – Facilities, Clouds and Containers: Containers
- Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
Track 7 – Facilities, Clouds and Containers: Non-LHC experiments
- Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))
Track 7 – Facilities, Clouds and Containers: Network technologies
- Oksana Shadura (University of Nebraska Lincoln (US))
- Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
Track 7 – Facilities, Clouds and Containers: Opportunistic resources
- Christoph Wissing (Deutsches Elektronen-Synchrotron (DE))
The IHEP local cluster is a middle-sized HEP data center which consists of 20’000 CPU slots, hundreds of data servers, 20 PB disk storage and 10 PB tape storage. After data taking of JUNO and LHAASO experiment, the data volume processed at this center will approach 10 PB data per year. Facing the current cluster scale, anomaly detection is a non-trivial task in daily maintenance. Traditional...
A Grid computing site consists of various services including Grid middlewares, such as Computing Element, Storage Element and so on. Ensuring a safe and stable operation of the services is a key role of site administrators. Logs produced by the services provide useful information for understanding the status of the site. However, it is a time-consuming task for site administrators to monitor...
The benchmarking and accounting of CPU resources in WLCG has been based on the HEP-SPEC06 (HS06) suite for over a decade. HS06 is stable, accurate and reproducible, but it is an old benchmark and it is becoming clear that its performance and that of typical HEP applications have started to diverge. After evaluating several alternatives for the replacement of HS06, the HEPIX benchmarking WG has...
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. The OSG Networking Area, in partnership with WLCG, is focused on being the primary source of networking information for its partners and...
Monitoring of the CERN Data Centres and the WLCG infrastructure is now largely based on the MONIT infrastructure provided by CERN IT. This is the result of the migration from several old in-house developed monitoring tools into a common monitoring infrastructure based on open source technologies such as Collectd, Flume, Kafka, Spark, InfluxDB, Grafana and others. The MONIT infrastructure...
The Centralised Elasticsearch Service at CERN runs the infrastructure to
provide Elasticsearch clusters for more than 100 different use cases.
This contribution presents how the infrastructure is managed, covering the
resource distribution, instance creation, cluster monitoring and user
support. The contribution will present the components that have been identified as
critical in order...
The EGI Cloud Compute service offers a multi-cloud IaaS federation that brings together research clouds as a scalable computing platform for research accessible with OpenID Connect Federated Identity. The federation is not limited to single sign-on, it also introduces features to facilitate the portability of applications across providers: i) a common VM image catalogue VM image replication to...
The cloudscheduler VM provisioning service has been running production jobs for ATLAS and Belle II for many years using commercial and private clouds in Europe, North America and Australia. Initially released in 2009, version 1 is a single Python 2 module implementing multiple threads to poll resources and jobs, and to create and destroy virtual machine. The code is difficult to scale,...
Cloud Services for Synchronization and Sharing (CS3) have become increasing popular in the European Education and Research landscape in the last
years. Services such as CERNBox, SWITCHdrive, CloudStor and many more have become indispensable in everyday work for scientists, engineers and in administration
CS3 services represent an important part of the EFSS market segment (Enterprise File...
The use of commercial cloud services has gained popularity in research environments. Not only it is a flexible solution for adapting computing capacity to the researchers' needs, it also provides access to the newest functionalities on the market. In addition, most service providers offer cloud credits, enabling researchers to explore innovative architectures before procuring them at scale....
In the last couple of years, we have been actively developing the Dynamic On-Demand Analysis Service (DODAS) as an enabling technology to deploy container-based clusters over any Cloud infrastructure with almost zero effort. The DODAS engine is driven by high-level templates written in the TOSCA language, that allows to abstract the complexity of many configuration details. DODAS is...
Cloud computing is becoming mainstream, with funding agencies moving beyond prototyping and starting to fund production campaigns, too. An important aspect of any production computing campaign is data movement, both incoming and outgoing. And while the performance and cost of VMs is relatively well understood, the network performance and cost is not.
We thus embarked on a network...
The traditional HEP analysis model uses successive processing steps to reduce the initial dataset to a size that permits real-time analysis. This iterative approach requires significant CPU time and storage of large intermediate datasets and may take weeks or months to complete. Low-latency, query-based analysis strategies are being developed to enable real-time analysis of primary datasets by...
Driven by the need to carefully plan and optimise the resources for the next data taking periods of Big Science projects, such as CERN’s Large Hadron Collider and others, sites started a common activity, the HEPiX Technology Watch Working Group, tasked with tracking the evolution of technologies and markets of concern to the data centres. The talk will give an overview of general and...
At the SDCC we are deploying a Jupyterhub infrastructure to enable
scientists from multiple disciplines to access our diverse compute and
storage resources. One major design goal was to avoid rolling out yet
another compute backend and leverage our pre-existing resources via our
batch systems (HTCondor and Slurm). Challenges faced include creating a
frontend that allows users to choose...
The WLCG has over 170 sites and the number is expected to grow in the coming years. In order to support WLCG workloads, each site has to deploy and maintain several middleware packages and grid services. Setting up, maintaining and supporting the grid infrastructure at a site can be a demanding activity and often requires significant assistance from WLCG experts. Modern configuration...
One of the most costly factors in providing a global computing infrastructure such as the WLCG is the human effort in deployment, integration, and operation of the distributed services supporting collaborative computing, data sharing and delivery, and analysis of extreme scale datasets. Furthermore, the time required to roll out global software updates, introduce new service components, or...
We describe the software tool-set being implemented in the contest of the NOTED [1] project to better exploit WAN bandwidth for Rucio and FTS data transfers, how it has been developed and the results obtained.
The first component is a generic data-transfer broker that interfaces with Rucio and FTS. It identifies data transfers for which network reconfiguration is both possible and...
The WLCG Web Proxy Auto Discovery (WPAD) service provides a convenient mechanism for jobs running anywhere on the WLCG to dynamically discover web proxy cache servers that are nearby. The web proxy caches are general purpose for a number of different http applications, but different applications have different usage characteristics and not all proxy caches are engineered to work with the...
Within the ATLAS detector, the Trigger and Data Acquisition system is responsible for the online processing of data streamed from the detector during collisions at the Large Hadron Collider (LHC) at CERN. The online farm is composed of ~4000 servers processing the data read out from ~100 million detector channels through multiple trigger levels. The capability to monitor the ongoing data...
Computational science, data management and analysis have been key factors in the success of Brookhaven Lab's scientific programs at the Relativistic Heavy Ion Collider (RHIC), the National Synchrotron Light Source (NSLS-II), the Center for Functional Nanomaterials (CFN), and in biological, atmospheric, and energy systems science, Lattice Quantum Chromodynamics (LQCD) and Materials Science as...
Since 2013 CERN’s local data centre combined with a colocation infrastructure at the Wigner data centre in Budapest have been hosting the compute and storage capacity for WLCG Tier-0. In this paper we will describe how we try to optimize and improve the operation of our local data centre to meet the anticipated increment of the physics compute and storage requirements for Run3, taking into...
The ATLAS Spanish Tier-1 and Tier-2s have more than 15 years of experience in the deployment and development of LHC computing components and their successful operations. The sites are already actively participating in, and even coordinating, emerging R&D computing activities developing the new computing models needed in the LHC Run3 and HL-LHC periods.
In this contribution, we present details...
DESY is one of the largest accelerator laboratories in Europe, developing and operating state of the art accelerators, used to perform fundamental science in the areas of high-energy physics photon science and accelerator development.\newline
While for decades high energy physics has been the most prominent user of the DESY compute, storage and network infrastructure, various scientific...
In recent years containerization has revolutionized cloud environments, providing a secure, lightweight, standardized way to package and execute software. Solutions such as Kubernetes enable orchestration of containers in a cluster, including for the purpose of job scheduling. Kubernetes is becoming a de facto standard, available at all major cloud computing providers, and is gaining increased...
We will describe the deployment of containers on the ATLAS infrastructure. There are several ways to run containers: as part of the batch system infrastructure, as part of the pilot, or called directly. ATLAS is exploiting them depending on which facility its jobs are sent to. Containers have been a vital part of the HPC infrastructure for the past year, and using fat images - images...
Container technologies are rapidly becoming the preferred way by developers and system administrators to package applications, distribute software and run services. A crucial role is played by container orchestration software such as Kubernetes, which is also the natural fit for microservice-based architectures. Complex services are re-thought as a collection of fundamental applications (each...
The CERN Batch Service faces many challenges in order to get ready for the computing demands of future LHC runs. These challenges require that we look at all potential resources, assessing how efficiently we use them and that we explore different alternatives to exploit opportunistic resources in our infrastructure as well as outside of the CERN computing centre.
Several projects, like...
High Performance Computing (HPC) facilities provide vast computational power and storage, but generally work on fixed environments designed to address the most common software needs locally, making it challenging for users to bring their own software. To overcome this issue, most HPC facilities have added support for HPC friendly container technologies such as Shifter, Singularity, or...
The new jAliEn (Java ALICE Environment) middleware is a Grid framework designed to satisfy the needs of the ALICE experiment for the LHC Run 3, such as providing a high-performance and high-scalability service to cope with the increased volumes of collected data. This new framework also introduces a split, two-layered job pilot, creating a new approach to how jobs are handled and executed...
The LHAASO(Large High Altitude Air Shower Observatory) experiment of IHEP is located in Daocheng, Sichuan province (at the altitude of 4410 m). The main scientific goals of LHAASO are searching for galactic cosmic ray origins by extensive spectroscopy investigations of gamma ray sources above 30TeV. To accomplish these goals, LHAASO contains four detector arrays, which generates huge amounts...
VIRGO is an interferometer for the detection of Gravitational Waves at the European Gravitational Observatory in Italy. Along with the two LIGO interferometers in the US, VIRGO is being used to collect data from astrophysical sources such as compact binary coalescences, and is currently running its third observational period, collecting gravitational wave events at a rate if more than one per...
Experiments in Photon Science at DESY will, in future, undergo significant changes in terms of data volumes, data rates and most important, to fully enable online (synchronous to experiment) data analysis. Primary goal is to support new type of experimental setups requiring significant computing effort to perform controlling and data quality monitoring, allow effective data reductions and,...
Project 8 is applying a novel spectroscopy technique to make a precision measurement of the tritium beta-decay spectrum, resulting in either a measurement of or further constraint on the effective mass of the electron antineutrino. ADMX is operating an axion haloscope to scan the mass-coupling parameter space in search of dark matter axions. Both collaborations are executing medium-scale...
IceCube sends out real-time alerts for neutrino events to other multi-messenger observatories around the world, including LIGO/VIRGO and electromagnetic observatories. The typical case is to send out an initial alert within one minute, then run more expensive processing to refine the direction and energy estimates and send a follow-on message. This second message has averaged 40 to 60...
This paper describes the work done by the AENEAS project to develop a concept and design for a distributed, federated, European SKA Regional Centre (ESRC) to support the compute, storage, and networking that will be required to achieve the scientific goals of the Square Kilometre Array (SKA).
The AENEAS (Advanced European Network of E-infrastructures for Astronomy with the SKA) project is a 3...
Dynamic resource provisioning in the WLCG is commonly based on meta-scheduling and the pilot model. For a given set of workflows, a meta-scheduler computes the ideal set of resources; so-called pilot jobs integrate these resources into an overlay batch system, which then processes the initial workflows. While offering a high level of control and precision, the strong coupling between...
Whether you consider “IoT” as a real thing or a buzzword, there’s no doubt that connected devices, data analysis and automation are transforming industry. CERN is no exception: a network of LoRa-based radiation monitors has recently been deployed and there is a growing interest in the advantages connected devices could bring—to accelerator operations just as much as to building management.
...
The Tokyo regional analysis center at the International Center for Elementary Particle Physics, the University of Tokyo, is one of the Tier 2 sites for the ATLAS experiment in the Worldwide LHC Computing Grid (WLCG). The current system provides 7,680 CPU cores and 10.56 PB disk storage for WLCG. CERN plans the high-luminosity LHC starting from 2026, which increases the peak luminosity to 5...
This talk explores the methods and results confirming the baseline assumption that LHCONE traffic is science traffic. The LHCONE (LHC Open Network Environment) is a network conceived to support globally distributed collaborative science. The LHCONE connects thousands of researchers to LHC data sets at hundreds of universities and labs performing analysis within the global collaboration. It is...
Increased operational effectiveness and the dynamic integration of only temporarily available compute resources (opportunistic resources) becomes more and more important in the next decade, due to the scarcity of resources for future high energy physics experiments as well as the desired integration of cloud and high performance computing resources. This results in a more heterogenous compute...
The use of IPv6 on the general internet continues to grow. Several Broadband/Mobile-phone companies, such as T-Mobile in the USA and BT/EE in the UK, now use IPv6-only networking with connectivity to the IPv4 legacy world enabled by the use of NAT64/DNS64/464XLAT. Large companies, such as Facebook, use IPv6-only networking within their internal networks, there being good management and...
DHCP is an often overlooked, but incredibly important component of the operation of every data center. With constantly scaling and dynamic environments, managing DHCP servers that rely on configuration files, which must be in sync, becomes both expensive engineering wise and slow. The LHCb Online infrastructure currently consists of over 2500 DHCP enabled devices - physical and virtual...
The Jefferson Lab 12GeV accelerator upgrade completed in 2015 is now producing data at volumes unprecedented for the lab. The resources required to process this data now exceed the capacity of the onsite farm necessitating the use of offsite computing resources for the first time in the history of JLab. GlueX is now utilizing NERSC for raw data production using the new SWIF2 workflow tool...
High Energy Physics (HEP) experiments have greatly benefited from a strong relationship with Research and Education (REN) network providers and thanks to the projects such as LHCOPN/LHCONE and REN contributions, have enjoyed significant capacities and high performance networks for some time. Network providers have been able to continually expand their capacities to over-provision the networks...
Access to both High Throughput Computing (HTC) and High Performance Computing (HPC) facilities is vitally important to the fusion community, not only for plasma modelling but also for advanced engineering and design, materials research, rendering, uncertainty quantification and advanced data analytics for engineering operations. The computing requirements are expected to increase as...
The Simulation at Point1 (Sim@P1) project was built in 2013 to take advantage of the ATLAS Trigger and Data Acquisition High Level Trigger (HLT) farm. The HLT farm provides around 100,000 cores, which are critical to ATLAS during data taking. When ATLAS is not recording data, this large compute resource is used to generate and process simulation data for the experiment. At the beginning of the...
Belle II has started the Phase 3 data taking with a fully quipped detector. The data flow at the maximum luminosity is expected to be 12PB of data/year and will be analysed by a cutting-edge computing infrastructure spread over 26 Countries. Some of the major Computing Centres for HEP in Europe, USA and Canada will store and tackle the second copy of RAW data.
In this scenario, the...