General description of the EOS service @CERN
Improving EOS monitoring of finished transfers. Hands-on eos io stat
output.
Prometheus is a modern, simple and scalable monitoring system with an easy to use query language based in labels. EOS Operators team has developed a fully-functional EOS Prometheus exporter in Golang to monitor all EOS metrics. This includes space, group, node, filesystem, I/O and namespace stats collectors. In this talk, the tool will be showcased and made available to the EOS Community.
Presentation on the new recording plug-in that allows I/O sampling and the replay tool.
With 100GE technology and erasure coding we discovered new bottlenecks and challenges. This presentation will recap the state of the art of the ALICEO2 EOS instance and show benchmarks including a real and and replayed physics analysis use case.
LHC Data Storage: RUN 3 Data Taking Commissioning
Update on the setup and operations at the Vienna Tier-2 site.
Fermilab has been running an EOS instance since testing began in June 2012. By May 2013, before becoming production storage, there was 600TB allocated for EOS. Today, there is approximately 13PB of storage available in the EOS instance.
An update of our current experiences and challenges running an EOS instance for use by the Fermilab LHC Physics Center (LPC) computing cluster. The LPC...
As part of its storage migration plan, the CMS Tier-2 center at Purdue University is preparing an EOS deployment of ~10PB, which will serve as the main Storage Element of the site, as well as a basis for the future Analysis Facility that’s in development at the moment. We adopted a fully containerized approach with Kubernetes, which allows us to better share available hardware resources...
This is going to be a brief presentation regarding the operation status of Custodial Disk Storage (CDS) system provided for the ALICE experiment as a Tape. The CDS system is basically using EOS with its erasure coding implementation (RAIN) for the data protection. The CDS joined the WLCG Tape Challenges in the previous year and about a PB of data has been transferred from the experiment. A...
In this communication, we are going to present the deployment project of the EOS storage software solution at the GRIF site. GRIF is a distributed site made of four (4) different subsites, in different locations of the Paris region. The worst network latency between the subsites is within 2-4 msec with 3 of them connected with a 100G connection. The objective is to consolidate the four (4)...
This is a talk introducing the GroupBalancer and what it does. We also cover about the current in place GroupBalancer improvements introduced from 4.8.78 release, the ways to configure this for deployments, some figures from existing deployements and what the roadmap for the future holds with these functionalities.
Migrating the AMS experiment data from EOSPUBLIC to EOSAMS02 stimulated development of tools which might be useful in general for similar exercises in the future. We will show the work in progress.
In preparation for Run-3 we have faced the following problem: we have to balance the usage of IO resources between individual activities, which has led to the implementation of IO priorities and bandwidth regulation policies. While commissioning the ALICEO2 EOS instance we have observed, that write performance using the buffer cache is a bottleneck on storage nodes. Direct IO helps to improve...
With XRootD5 the on the wire protocol provides confidentiality of data inside the transport layer. However data files are human readable on storage nodes and can be accessed and downloaded by any EOS administrator and any person with read access. Filesystem level encryption on storage nodes does not solve this confidentiality problem.
To provide better data privacy the most recent versions...
Physics and CERNBOX instances at CERN are exposed to O(4) mount clients simultaneously. Overloads from batch access is not a new thing - since years the AFS filesystem suffers more or less frequently volume overloads. During overload episodes meta-data access at the MGM slows down significantly because thousands of batch nodes compete against few interactive clients and sync & share access. To...
A primer on xrdcp new (and old) features like zip append, metalling support, retries and many more.
Context: Productisation of Windows native connection of EOS to Windows operating system.
Objectives: The professional implementation of the EOS with the Windows platform should allow seamless usage of EOS as a Windows local disk with all the EOS benefits, as it is low latency, high throughput, and high reliability.
Method: Implementation of the EOS client for the Windows...
EOS durability machinery is a set of (operator's) scripts, tools and EOS components to classify, monitor and repair unhealthy files. EOS filesystem check (fsck) was enabled in 2021, but one should keep track of the instances' state, and investigate root causes for the problems found.
CERNBox is key enabler service built on top of EOS for users at CERN and beyond. The service is used by more than 37K users and stores over 15PB of data, representing all the user communities at the laboratory.
In this talk we will explain the current status of the service, the challenges we faced in 2021 and our vision for the future: CERNBox as the gateway for a federation of...
EOS provides the backend to CERNBox, the cloud sync and share service implementation used at CERN. EOS for CERNBox is storing 12PB of user and project space data across 9 different instances running in multi-fst configuration. This presentation will give an overview of 2021 challenges, how we tried to address them and talk about the roadmap for the service for 2022.
More than 300 million CERNBox files are processed daily using cback backup tool, which ensures that files are safely stored in a different geographical area and using a different storage backend. The backup tool has not stop evolving and was extended to support CephFS mount backup along with EOS mounts under the same infrastructure. This talk will present the current status of the project...
The CERNBox service is currently backed by 13PB of EOS storage distributed across more than 3,000 drives. EOS has proven to be a reliable and highly performing backend throughout. On the other hand, the CERN Storage Group also operates CephFS, which has been previously evaluated in combination with EOS as a potential solution for large scale physics data taking [1]. This work seeks to further...
To consolidate the concept of sharing implemented inside EOS for any access protocol we are currently adding a new type of ACL which defines a 'share'. One of the new characteristics of a share ACL is that they are not influenced by POSIX or classic ACLs. We support additional ACL capabilities as 'can share'.
A second important new concept is the concept of ownership by an EGROUP. Ownership...
EOS provides a very detailed log system which provides useful information of all the user and system operations that are performed at any time. Each EOS daemon has its own log file and tracing operations that involve different components can be a time consuming task (MGM -> FST1 -> FST2). With Grafana Loki and Promtail, we setup a logging aggregation system that allows tracing operations...
In this talk we present the evolution of the CERNBox Samba service that we operate in front of EOS. An important recent change is the adoption of a new layout based on bind mounts: this allows to operate a smaller number of EOS mounts and to enable federating multiple EOS instances in a single namespace. We will discuss further measures adopted to address the ever increasing load from the...
Understanding the configuration and logic used by eosxd on /eos/ is not straight forward in particular in containerized environments. This short presentation tries to explain the basics.
This contribution illustrates how we have evolved file locking in CERNBox and EOS. Initially introduced to support Office online applications, the functionality has been extended to be an integral part of Reva, the engine powering CERNBox. We will describe the implementation in the EOS storage system, and the foreseen extensions to cover Linux file locks (flocks) as supported for FUSE and...
In this presentation, we will report on how we at AARNet deployed CTA along with restic backup client as a backup/ archive solution for our production EOS clusters. The solution has been in production since late 2021. This presentation will aim to cover why we chose CTA, how CTA is deployed, and how it is integrated into our backup workflow.
EOS is now the main Storage System for IHEP experiments like LHAASO and JUNO. And Castor has been used for backup experiment data for a long time at IHEP, and has difficulty to satisfiy data backup requirement of new experiments like LHAASO, JUNO. As EOSCTA became stable to replace Castor in production, we started EOSCTA evaluation and the castor migration. In this talk, we will give a brief...
An EOSCTA instance is an EOS instance commonly called a tape buffer configured with a CERN Tape Archive (CTA) back-end.
This EOS instance is entirely bandwidth oriented: it offers an SSD based tape interconnection, it can contain spinning disks if needed and it is optimized for the various tape workflows.
This talk will present how to enable EOS for tape using CTA and the Swiss horology...
CTA uses access mechanism provided by EOS and adds tape-specific layer. If one of these elements is misconfigured, a user won't be able to read a file, or, on the contrary, unauthorized access can be granted.
This talk explains how the combination of the ACL, Unix permissions and mount rules works in CTA. We show which tools we use for the permissions management and what are capabilities...
Explanation of the CTA Tape Drive status during a data transfer session.
This talk sumarizes the new file restoring feature of CTA, how it works, how to configure it, when it should be used and it's current limitations.
This presentation summarizes the current effort to detect, and therebye subsequenly remedy, inconsistencies in the file metadata stored on EOS and CTA.
We show how we combine and validate EOSCTA namespaces in order to produce a summary of healthy files for experiments and a troubleshooting tool for operators.
Fermilab is the primary research lab dedicated to particle physics in the United States and also is home to the largest archival HEP data store outside of CERN. Fermilab currently employs a HSM based on Enstore, a Fermilab product, and dCache, for tape and disk, respectively. This Enstore+dCache HSM manages nearly 300 PB of active data on tape. Because of the necessary development work to...
This talk will present details of the deployment of Antares, the EOS-CTA service at RAL Tier-1, which replaces Castor.
The ever increasing amount of data that is produced by modern scientific facilities like EuXFEL or LHC puts a high pressure on the data management infrastructure at the laboratories. This includes poorly shareable resources of archival storage, typically, tape libraries. To achieve maximal efficiency of the available tape resources a deep integration between hardware and software components...
Report on the latest tests done at SLAC with the native XRootD EC library.
This presentation will introduce the roadmap for EOS5 during the Run-3 period.