4–8 Nov 2019
Adelaide Convention Centre
Australia/Adelaide timezone

CDFS: A high-efficiency Data Access System for Storage Federations

7 Nov 2019, 11:30
15m
Riverbank R8 (Adelaide Convention Centre)

Riverbank R8

Adelaide Convention Centre

Oral Track 4 – Data Organisation, Management and Access Track 4 – Data Organisation, Management and Access

Speakers

Shiyuan Fu Shiyuan Fu (Institute of High Energy Physics,Chinese Academy of Sciences)

Description

High energy physics (HEP) experiments produce a large amount of data, which is usually stored and processed on distributed sites. Nowadays, the distributed data management system faces some challenges such as global file namespace and efficient data access. Focusing on those problems, the paper proposed a cross-domain data access file system (CDFS), a data cache and access system across distributed sites based on edge computing model, using flexible data caching and synchronization, applying data deduplication and compression, aiming at dynamically building an aggregate view of multiple distributed storage and accessing data in a fast and efficient way.
The CDFS system consists of metadata server, cache server, storage-optimized engine, and data access interface. Metadata server locally builds a very fast dynamic namespace from multiple sites that expose protocols such as Xrootd, HTTP and S3, covering the real file location. Cache server caches and synchronizes file content and metadata on-demand, speeding up data access and directory organization. Storage-optimized engine includes deduplication and compression. Deduplication assure that only nonexistent data block can be transferred to the site, eliminating redundant storage of the same data blocks at one site; compression makes data blocks stored after being compressed, minimizing the space that one data block required. The data access interface provides a command line and a FUSE client for users to access data in a convenient way, hiding the complexity of the transfer process.
The test results based on the raw data of LHAASO experiment showed that the CDFS could present a unique repository based on distributed data in the sites of Chengdu, Daocheng and Beijing. In addition, the caching mechanism leads to a more than 10 times improvement in data access performance, while the storage-optimized engine reduces the storage consumption of the raw data by about 50%.

Consider for promotion No

Primary authors

Shiyuan Fu Shiyuan Fu (Institute of High Energy Physics,Chinese Academy of Sciences) Qi Xu (Institute of High Energy Physics,Chinese Academy of Sciences) Yaodong Cheng (IHEP) Gang CHEN (INSTITUTE OF HIGH ENERGY PHYSICS)

Presentation materials