Conveners
T8 - Networks and facilities: S2
- Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
- Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
T8 - Networks and facilities: S4
- Wei Yang (SLAC National Accelerator Laboratory (US))
- Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
T8 - Networks and facilities: S6
- Oksana Shadura (University of Nebraska Lincoln (US))
- Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. OSG Networking Area in partnership with WLCG has focused on collecting, storing and making available all the network related metrics for further...
The fraction of general internet traffic carried over IPv6 continues to grow rapidly. The transition of WLCG central and storage services to dual-stack IPv4/IPv6 is progressing well, thus enabling the use of IPv6-only CPU resources as agreed by the WLCG Management Board and presented by us at CHEP2016. By April 2018, all WLCG Tier 1 data centres will provide access to their services over IPv6....
Data-intensive science collaborations still face challenges when transferring large data sets between globally distributed endpoints. Many issues need to be addressed to orchestrate the network resources in order to better explore the available infrastructure. In multi-domain scenarios, the complexity increases because network operators rarely export the network topology to researchers and...
Recent years have seen the mass adoption of streaming in mobile computing, an increase in size and frequency of bulk long-haul data transfers
in science in general, and the usage of big data sets in job processing
demanding real-time long-haul accesses that can be greatly affected by
variations in latency. It has been shown in the Physics and climate research communities that the need to...
Networking is foundational to the ATLAS distributed infrastructure and there are many ongoing activities related to networking both within and outside of ATLAS. We will report on the progress in a number of areas exploring ATLAS's use of networking and our ability to monitor the network, analyze metrics from the network, and tune and optimize application and end-host parameters to make the...
The First-level Event Selector (FLES) is the main event selection
system of the upcoming CBM experiment at the future FAIR facility in
Germany. As the central element, a high-performance compute
cluster analyses free-streaming, time-stamped data delivered from the
detector systems at rates exceeding 1 TByte/s and selects data
for permanent storage.
While the detector systems are located in a...
Network performance is key to the correct operation of any modern datacentre infrastructure or data acquisition (DAQ) system. Hence, it is crucial to ensure the devices employed in the network are carefully selected to meet the required needs.
The established benchmarking methodology [1,2] consists of various tests that create perfectly reproducible traffic patterns. This has the advantage of...
We provide KEK general purpose network to support various kinds of research activities in the field of high-energy physics, material physics, and accelerator physics. Since the end of 20th century, cyber attacks to the network are on an almost daily basis, and attack techniques change rapidly and drastically. In such circumstances, we are constantly facing difficult tradeoff and are required...
Benchmarking is a consolidated activity in High Energy Physics (HEP) computing where large computing power is needed to support scientific workloads. In HEP, great attention is paid to the speed of the CPU in accomplishing high-throughput tasks characterised by a mixture of integer and floating point operations and a memory footprint of few gigabytes.
As of 2009, HEP-SPEC06 (HS06) is the...
Based on the observation of low average CPU utilisation of several hundred disk servers in the EOS storage system at CERN, the Batch on EOS Extra Resources (BEER) project developed an approach to utilise these resources for batch processing. After initial proof of concept tests, showing almost no interference between the batch and storage services, a model for production has been developed and...
The new unified monitoring (MONIT) for the CERN Data Centres and for the WLCG Infrastructure is now based on established open source technologies for collection, streaming and storage of monitoring data. The previous solutions, based on in-house development and commercial software, are been replaced with widely- recognized technologies such as Collectd, Flume, Kafka, ElasticSearch, InfluxDB,...
CERN has been using ITIL Service Management methodologies and ServiceNow since early 2011. Initially a joint project between just the Information Technology and the General Services Departments, now most of CERN is using this common methodology and tool, and all departments are represented totally or partially in the CERN Service Catalogue.
We will present a summary of the current situation...
In the CERN IT agile infrastructure, Puppet, CERN IT central messaging infrastructure and the roger application are the key constituents handling the configuration of the machines of the computer centre. The machine configuration at any given moment depends on its declared state in roger and Puppet ensures the actual implementation of the desired configuration by running the puppet agent on...
Prometheus is a leading open source monitoring and alerting tool. Prometheus also utilizes a pull model, in the sense is pulls metrics from monitored entities, rather than receives them as a push. But sometimes this can be a major headache, even without security in mind, when performing network gymnastics to reach your monitored entities. Not only that, but sometimes system metrics might be...
With the explosion of the number of distributed applications, a new dynamic server environment emerged grouping servers into clusters, which utilization depends on the current demand for the application.
To provide reliable and smooth services it is crucial to detect and fix possible erratic behavior of individual servers in these clusters. Use of standard techniques for this purpose delivers...
The Alpha Magnetic Spectrometer (AMS) is a high energy physics experiment installed and operating on board of the International Space Station (ISS) from May 2011 and expected to last through Year 2024 and beyond. The Science Operation Centre is in charge of the offline computing for the AMS experiment, including flight data production, Monte-Carlo simulation, data management, data backup, etc....
The INFN Tier-1 center at CNAF has been extended in 2016 and 2017 in order to include a small amount of resources (~24 kHS06 corresponding to ~10% of the CNAF pledges for LHC in 2017) physically located art the Bari-ReCas site (~600 km far from CNAF).
In 2018, a significant percentage of the CPU power (~170 kHS06, equivalent to ~50% of the total CNAF pledges) are going to be provided via a...
Experience to date indicates that the demand for computing resources in high energy physics shows a highly dynamic behaviour, while the provided resources by the WLCG remain static over the year. It has become evident that opportunistic resources such as High Performance Computing (HPC) centers and commercial clouds are very well suited to cover peak loads. However, the utilization of this...
There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with SLURM as the workload manager. The resources of the HTCondor cluster are provided by multiple experiments, and the resource utilization has reached more...
While the LHCb experiment will be using a local data-centre at the experiment site for its computing infrastructure in Run3, LHCb is also evaluating the possibility to move its High Level Trigger server farm into an IT data-centre located few kilometres away from the LHCb detector. If proven feasible and if it could be replicated by other LHC experiments, the solution would allow the...
The ALICE computing model for Run3 foresees few big centres, called Analysis Facilities, optimised for fast processing of large local sets of Analysis Object Data (AODs). Contrary to the current running of analysis trains on the Grid, this will allow for more efficient execution of inherently I/O-bound jobs. GSI will host one of these centres and has therefore finalised a first Analysis...
Even as grid middleware and analysis software has matured over the course of the LHC's lifetime it is still challenging for non-specialized computing centers to contribute resources. Many U.S. CMS collaborators would like to set up Tier-3 sites to contribute campus resources for the use of their local CMS group as well as the collaboration at large, but find the administrative burden of...