Conveners
T4 - Data handling: S1
- Tigran Mkrtchyan (A.Alikhanyan National Science Laboratory (AM))
- Tigran Mkrtchyan (DESY)
T4 - Data handling: S2
- Tigran Mkrtchyan (DESY)
- Tigran Mkrtchyan (A.Alikhanyan National Science Laboratory (AM))
T4 - Data handling: S3
- Costin Grigoras (CERN)
T4 - Data handling: S4
- Costin Grigoras (CERN)
T4 - Data handling: S5
- Costin Grigoras (CERN)
T4 - Data handling: S6
- Elizabeth Gallas (University of Oxford (GB))
T4 - Data handling: S7
- Elizabeth Gallas (University of Oxford (GB))
-
PATRICK MEADE (University of Wisconsin-Madison)09/07/2018, 11:00Track 4 - Data Handlingpresentation
IceCube is a cubic kilometer neutrino detector located at the south pole. Every year, 29 TB of data are transmitted via satellite, and 365 TB of data are shipped on archival media, to the data warehouse in Madison, WI, USA. The JADE Long Term Archive (JADE-LTA) software indexes and bundles IceCube files and transfers the archive bundles for long term storage and preservation into tape silos...
Go to contribution page -
Michael Davis (CERN)09/07/2018, 11:15Track 4 - Data Handlingpresentation
The first production version of the CERN Tape Archive (CTA) software is planned to be released for the end of 2018. CTA is designed to replace CASTOR as the CERN tape archive solution, in order to face scalability and performance challenges arriving with LHC Run-3.
This contribution will describe the main commonalities and differences of CTA with CASTOR. We outline the functional enhancements...
Go to contribution page -
Tigran Mkrtchyan (DESY)09/07/2018, 11:30Track 4 - Data Handlingpresentation
The dCache project provides open-source storage software deployed internationally to satisfy ever more demanding scientific storage requirements. Its multifaceted approach provides an integrated way of supporting different use-cases with the same storage, from high throughput data ingest, through wide access and easy integration with existing systems.
In supporting new communities, such as...
Go to contribution page -
Dr Doris Ressmann (KIT)09/07/2018, 11:45Track 4 - Data Handlingpresentation
Tape storage is still a cost effective way to keep large amounts of data over a long period of time. It is expected that this will continue in the future. The GridKa tape environment is a complex system of many hardware components and software layers. Configuring this system for optimal performance for all use cases is a non-trivial task and requires a lot of experience. We present the current...
Go to contribution page -
Valentin Y Kuznetsov (Cornell University (US))09/07/2018, 12:00Track 4 - Data Handlingpresentation
The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this talk we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system...
Go to contribution page -
Julia Andreeva (CERN)09/07/2018, 12:15Track 4 - Data Handlingpresentation
The WLCG computing infrastructure provides distributed storage capacity hosted at the geographically dispersed computing sites.
Go to contribution page
In order to effectively organize storage and processing of the LHC data, the LHC experiments require a reliable and complete overview of the storage capacity in terms of the occupied and free space, the storage shares allocated to different computing activities, and... -
Samuel Cadellin Skipsey09/07/2018, 14:00Track 4 - Data Handlingpresentation
Pressures from both WLCG VOs and externalities have led to a desire to "simplify" data access and handling for Tier-2 resources across the Grid. This has mostly been imagined in terms of reducing book-keeping for VOs, and total replicas needed across sites. One common direction of motion is to increasing the amount of remote-access to data for jobs, which is also seen as enabling the...
Go to contribution page -
Dr Teng LI (University of Edinburgh)09/07/2018, 14:15Track 4 - Data Handlingpresentation
The XCache (XRootD Proxy Cache) provides a disk-based caching proxy for data access via the XRootD protocol. This can be deployed at WLCG Tier-2 computing sites to provide a transparent cache service for the optimisation of data access, placement and replication.
We will describe the steps to enable full read/write operations to storage endpoints consistent with the distributed data...
Go to contribution page -
Christoph Heidecker (KIT - Karlsruhe Institute of Technology (DE))09/07/2018, 14:30Track 4 - Data Handlingpresentation
High throughput and short turnaround cycles are core requirements for the efficient processing of I/O-intense end-user analyses. Together with the tremendously increasing amount of data to be processed, this leads to enormous challenges for HEP storage systems, networks and the data distribution to end-users. This situation is even compounded by taking into account opportunistic resources...
Go to contribution page -
Zbigniew Baranowski (CERN)09/07/2018, 14:45Track 4 - Data Handlingpresentation
The interest in using Big Data solutions based on Hadoop ecosystem is constantly growing in HEP community. This drives the need for increased reliability and availability of the central Hadoop service and underlying infrastructure provided to the community by the CERN IT department.
Go to contribution page
This contribution will report on the overall status of the Hadoop platform and the recent enhancements and... -
Dirk Duellmann (CERN)09/07/2018, 15:00Track 4 - Data Handlingpresentation
The EOS deployment at CERN is a core service used for both scientific data
processing, analysis and as back-end for general end-user storage (eg home directories/CERNBOX).
The collected disk failure metrics over a period of 1 year from a deployment
size of some 70k disks allows a first systematic analysis of the behaviour
of different hard disk types for the large CERN use-cases.In this...
Go to contribution page -
Elizabeth Gallas (University of Oxford (GB))09/07/2018, 15:15Track 4 - Data Handlingpresentation
Processing ATLAS event data requires a wide variety of auxiliary information from geometry, trigger, and conditions database systems. This information is used to dictate the course of processing and refine the measurement of particle trajectories and energies to construct a complete and accurate picture of the remnants of particle collisions. Such processing occurs on a worldwide computing...
Go to contribution page -
Holger Schulz (Fermilab)09/07/2018, 15:30Track 4 - Data Handlingpresentation
In their measurement of the neutrino oscillation parameters (PRL 118, 231801
Go to contribution page
(2017)), NOvA uses a sample of approximately 27 million reconstructed spills to
search for electron-neutrino appearance events. These events are stored in an
n-tuple format, in 180 thousand ROOT files. File sizes range from a few hundred KiB to a
few MiB; the full dataset is approximately 3 TiB. These millions of... -
Mario Lassnig (CERN)09/07/2018, 15:45Track 4 - Data Handlingpresentation
With the LHC High Luminosity upgrade the workload and data management systems are facing new major challenges. To address those challenges ATLAS and Google agreed to cooperate on a project to connect Google Cloud Storage and Compute Engine to the ATLAS computing environment. The idea is to allow ATLAS to explore the use of different computing models, to allow ATLAS user analysis to benefit...
Go to contribution page -
Lorenzo Rinaldi (Universita e INFN, Bologna (IT))10/07/2018, 11:00Track 4 - Data Handlingpresentation
The ATLAS experiment is approaching mid-life: the long shutdown period (LS2) between LHC Runs 1 and 2 (ending in 2018) and the future collision data-taking of Runs 3 and 4 (starting in 2021). In advance of LS2, we have been assessing the future viability of existing computing infrastructure systems. This will permit changes to be implemented in time for Run 3. In systems with broad impact...
Go to contribution page -
Lynn Wood (Pacific Northwest National Laboratory, USA)10/07/2018, 11:15Track 4 - Data Handlingpresentation
The Belle II experiment at KEK is preparing for first collisions in early 2018. Processing the large amounts of data that will be produced requires conditions data to be readily available to systems worldwide in a fast and efficient manner that is straightforward for both the user and maintainer. This was accomplished by relying on industry-standard tools and methods: the conditions database...
Go to contribution page -
Dave Dykstra (Fermi National Accelerator Lab. (US))10/07/2018, 11:30Track 4 - Data Handlingpresentation
LHC experiments make extensive use of Web proxy caches, especially for software distribution via the CernVM File System and for conditions data via the Frontier Distributed Database Caching system. Since many jobs read the same data, cache hit rates are high and hence most of the traffic flows efficiently over Local Area Networks. However, it is not always possible to have local Web caches,...
Go to contribution page -
156. A new mechanism to use the Conditions Database REST API to serve the ATLAS detector descriptionAlessandro De Salvo (Sapienza Universita e INFN, Roma I (IT))10/07/2018, 11:45Track 4 - Data Handlingpresentation
An efficient and fast access to the detector description of the ATLAS experiment is needed for many tasks, at different steps of the data chain: from detector development to reconstruction, from simulation to data visualization. Until now, the detector description was only accessible through dedicated services integrated into the experiment's software framework, or by the usage of external...
Go to contribution page -
Marco Clemencic (CERN)10/07/2018, 12:00Track 4 - Data Handlingpresentation
LHCb has been using the CERN/IT developed Conditions Database library COOL for several years, during LHC Run 1 and Run 2. With the opportunity window of the second long shutdown of LHC, in preparation for Run 3 and the upgraded LHCb detector, we decided to investigate alternatives to COOL as Conditions Database backend. In particular, given our conditions and detector description data model,...
Go to contribution page -
Yaodong Cheng (Chinese Academy of Sciences (CN))10/07/2018, 12:15Track 4 - Data Handlingpresentation
Beijing Spectrometer (BESIII) experiment has produced hundreds of billions of events. It has collected the world's largest data samples of J/ψ, ψ(3686), ψ(3770) andψ(4040) decays. The typical branching fractions for interesting physics channels are of the order of O(10^-3). The traditional event-wise accessing of BOSS (Bes Offline Software System) is not effective for the selective accessing...
Go to contribution page -
Rob Appleyard (STFC)10/07/2018, 14:00Track 4 - Data Handlingpresentation
Since February 2017, the RAL Tier-1 has been storing production data from the LHC experiments on its new Ceph backed object store called Echo. Echo has been designed to meet the data demands of LHC Run 3 and should scale to meet the challenges of HL-LHC. Echo is already providing better overall throughput than the service it will replace (CASTOR) even with significantly less hardware...
Go to contribution page -
Mr Tigran Mkrtchyan (DESY)10/07/2018, 14:15Track 4 - Data Handlingpresentation
The life cycle of the scientific data is well defined: data is collected, then processed,
Go to contribution page
archived and finally deleted. Data is never modified. The original data is used or new,
derived data is produced: Write Once Read Many times (WORM). With this model in
mind, dCache was designed to handle immutable files as efficiently as possible. Currently,
data replication, HSM connectivity and... -
Fabrizio Furano (CERN)10/07/2018, 14:30Track 4 - Data Handlingpresentation
The DPM (Disk Pool Manager) system is a multiprotocol scalable technology for Grid storage that supports about 130 sites for a total of about 90 Petabytes online.
The system has recently completed the development phase that had been announced in the past years, which consolidates its core component (DOME: Disk Operations Management Engine) as a full-featured high performance engine that can...
Go to contribution page -
Herve Rousseau (CERN)10/07/2018, 14:45Track 4 - Data Handlingpresentation
The CERN IT Storage group operates multiple distributed storage systems and is
responsible
for the support of the infrastructure to accommodate all CERN storage
requirements, from the
physics data generated by LHC and non-LHC experiments to the personnel users’
files.EOS is now the key component of the CERN Storage strategy. It allows to
Go to contribution page
operate at high incoming
throughput for experiment... -
Andrea Manzi (CERN)10/07/2018, 15:00Track 4 - Data Handlingpresentation
The EOS namespace has outgrown its legacy in-memory implementation, presenting the need for an alternative solution. In response to this need we developed QuarkDB, a highly-available datastore capable of serving as the metadata backend for EOS. Even though the datastore was tailored to the needs of the namespace, its capabilities are generic.
We will present the overall system design, and our...
Go to contribution page -
Hugo Gonzalez Labrador (CERN)10/07/2018, 15:15Track 4 - Data Handlingpresentation
CERNBox is the CERN cloud storage hub. It allows synchronising and sharing files on all major desktop and mobile platforms (Linux, Windows, MacOSX, Android, iOS) aiming to provide universal access and offline availability to any data stored in the CERN EOS infrastructure.
With more than 12000 users registered in the system, CERNBox has responded to the high demand in our diverse community to...
Go to contribution page -
Hugo Gonzalez Labrador (CERN)10/07/2018, 15:30Track 4 - Data Handlingpresentation
In the last few years we have been seeing constant interest for technologies providing effective cloud storage for scientific use, matching the requirements of price, privacy and scientific usability. This interest is not limited to HEP and extends out to other scientific fields due to the fast data increase: for example, "big data" is a characteristic of modern genomics, energy and financial...
Go to contribution page -
Herve Rousseau (CERN)10/07/2018, 15:45Track 4 - Data Handlingpresentation
The Ceph File System (CephFS) is a software-defined network filesystem built upon the RADOS object store. In the Jewel and Luminous releases, CephFS was labeled as production ready with horizontally scalable metadata performance. This paper seeks to evaluate that statement in relation to both the HPC and general IT infrastructure needs at CERN. We highlights the key metrics required by four...
Go to contribution page -
Nicolo Magini (INFN e Universita Genova (IT))11/07/2018, 11:30Track 4 - Data Handlingpresentation
he ATLAS experiment is gradually transitioning from the traditional file-based processing model to dynamic workflow management at the event level with the ATLAS Event Service (AES). The AES assigns fine-grained processing jobs to workers and streams out the data in quasi-real time, ensuring fully efficient utilization of all resources, including the most volatile. The next major step in this...
Go to contribution page -
Hasib Md (University of Delhi (IN))11/07/2018, 11:45Track 4 - Data Handlingpresentation
Alignment and calibration workflows in CMS require a significant operational effort, due to the complexity of the systems involved. To serve the variety of condition data management needs of the experiment, the alignment and calibration team has developed and deployed a set of web-based applications. The Condition DB Browser is the main portal to search, navigate and prepare a consistent set...
Go to contribution page -
Thomas Maier (Ludwig Maximilians Universitat (DE))11/07/2018, 12:00Track 4 - Data Handlingpresentation
For high-throughput computing the efficient use of distributed computing resources relies on an evenly distributed workload, which in turn requires wide availability of input data that is used in physics analysis. In ATLAS, the dynamic data placement agent C3PO was implemented in the ATLAS distributed data management system Rucio which identifies popular data and creates additional, transient...
Go to contribution page -
Alvaro Fernandez Casani (Univ. of Valencia and CSIC (ES))11/07/2018, 12:15Track 4 - Data Handlingpresentation
The ATLAS EventIndex currently runs in production in order to build a
complete catalogue of events for experiments with large amounts of data.The current approach is to index all final produced data files at CERN Tier0,
Go to contribution page
and at hundreds of grid sites, with a distributed data collection architecture
using Object Stores to temporarily maintain the conveyed information, with
references to them... -
Brian Paul Bockelman (University of Nebraska Lincoln (US))11/07/2018, 12:30Track 4 - Data Handlingpresentation
GridFTP transfers and the corresponding Grid Security Infrastructure (GSI)-based authentication and authorization system have been data transfer pillars of the Worldwide LHC Computing Grid (WLCG) for more than a decade. However, in 2017, the end of support for the Globus Toolkit - the reference platform for these technologies - was announced. This has reinvigorated and expanded efforts to...
Go to contribution page -
Brian Paul Bockelman (University of Nebraska Lincoln (US))11/07/2018, 12:45Track 4 - Data Handlingpresentation
Outside the HEP computing ecosystem, it is vanishingly rare to encounter user X509 certificate authentication (and proxy certificates are even more rare). The web never widely adopted the user certificate model, but increasingly sees the need for federated identity services and distributed authorization. For example, Dropbox, Google and Box instead use bearer tokens issued via the OAuth2...
Go to contribution page -
Martin Barisits (CERN)12/07/2018, 11:00Track 4 - Data Handlingpresentation
Rucio, the distributed data management system of the ATLAS collaboration already manages more than 330 Petabytes of physics data on the grid. Rucio has seen incremental improvements throughout LHC Run-2 and is currently being prepared for the HL-LHC era of the experiment. Next to these improvements the system is currently evolving into a full-scale generic data management system for...
Go to contribution page -
Janusz Martyniak12/07/2018, 11:15Track 4 - Data Handlingpresentation
The SoLid experiment is a short-baseline neutrino project located at the BR2 research reactor in Mol, Belgium. It started data taking in November 2017. Data management, including long term storage will be handled in close collaboration by VUB Brussels, Imperial College London and Rutherford Appleton Laboratory (RAL).
Go to contribution page
The data management system makes the data available for analysis on the... -
Simone Campana (CERN)12/07/2018, 11:30Track 4 - Data Handlingpresentation
The computing strategy document for HL-LHC identifies storage as one of the main WLCG challenges in one decade from now. In the naive assumption of applying today’s computing model, the ATLAS and CMS experiments will need one order of magnitude more storage resources than what could be realistically provided by the funding agencies at the same cost of today. The evolution of the computing...
Go to contribution page -
Ms Qiumei Ma (IHEP)12/07/2018, 11:45Track 4 - Data Handlingpresentation
BES III experiment have taked data more than ten years, about fifty thounsand runs have been taken. So how to manage these large data is a big challenge to us. For years, we have created an efficient and complete data management system, including MySQL database, C++ API, BookKeeping system, monitor applications and etc. I will focus on introduce BESIII central database management system’s...
Go to contribution page -
Dr Malachi Schram (Pacific Northwest National Laboratory)12/07/2018, 12:00Track 4 - Data Handlingpresentation
The Belle II experiment at the SuperKEKB collider in Tsukuba, Japan, will start taking physics data in early 2018 and aims to accumulate 50/ab, or approximately 50 times more data than the Belle experiment. The collaboration expects it will manage and process approximately 200 PB of data.
Computing at this scale requires efficient and coordinated use of the compute grids in North America,...
Go to contribution page -
PATRICK MEADE (University of Wisconsin-Madison)12/07/2018, 12:15Track 4 - Data Handlingpresentation
IceCube is a cubic kilometer neutrino detector located at the south pole. Metadata for files in IceCube has traditionally been handled on an application by application basis, with no user-facing access. There has been no unified view of data files, and users often just ls the filesystem to locate files. Recently effort has been put into creating such a unified view. Going for a simple...
Go to contribution page -
Alastair Dewhurst (STFC-Rutherford Appleton Laboratory (GB))12/07/2018, 14:00Track 4 - Data Handlingpresentation
CVMFS has proved an extremely effective mechanism for providing scalable, POSIX like, access to experiment software across the Grid. The normal method for file access is http downloads via squid caches from a small number of Stratum 1 servers. In the last couple of years this mechanisms has been extended to allow access of files from any storage offering http access. This has been named...
Go to contribution page -
Silvio Pardi (INFN)12/07/2018, 14:15Track 4 - Data Handlingpresentation
The implementation of Cache Systems in the computing model of HEP experiments enables to accelerate access to hot data sets by scientists, opening new scenarios of data distribution and enable to exploit the paradigm of storage-less sites.
Go to contribution page
In this work, we present a study for the creation of an http data-federation eco-system with caching functionality. By exploiting the volatile-pool concept... -
Jan Erik Sundermann (Karlsruhe Institute of Technology (KIT))12/07/2018, 14:30Track 4 - Data Handlingpresentation
The computing center GridKa is serving the ALICE, ATLAS, CMS and LHCb experiments as one of the biggest WLCG Tier-1 centers world wide with compute and storage resources. It is operated by the Steinbuch Centre for Computing at Karlsruhe Institute of Technology in Germany. In April 2017 a new online storage system was put into operation. In its current stage of expansion it offers the HEP...
Go to contribution page -
Daniele Cesini (Universita e INFN, Bologna (IT))12/07/2018, 14:45Track 4 - Data Handlingpresentation
The development of data management services capable to cope with very large data resources is a key challenge to allow the future e-infrastructures to address the needs of the next generation extreme scale scientific experiments.
Go to contribution page
To face this challenge, in November 2017 the H2020 “eXtreme DataCloud - XDC” project has been launched. Lasting for 27 months and combining the expertise of 8 large... -
Dr Marcus Ebert (University of Victoria)12/07/2018, 15:00Track 4 - Data Handlingpresentation
The dynamic data federation software (Dynafed), developed by CERN IT, provides a federated storage cluster on demand using the HTTP protocol with WebDAV extensions. Traditional storage sites which support an experiment can be added to Dynafed without requiring any changes to the site. Dynafed also supports direct access to cloud storage such as S3 and Azure. We report on the usage of Dynafed...
Go to contribution page -
Paul Millar (DESY)12/07/2018, 15:15Track 4 - Data Handlingpresentation
Whatever the use case, for federated storage to work well some knowledge from each storage system must exist outside that system. This is needed to allow coordinated activity; e.g., executing analysis jobs on worker nodes with good accessibility to the data.
Currently, this is achieved by clients notifying central services of activity; e.g., a client notifies a replica catalogue after an...
Go to contribution page