Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

BESIII Physics Data Storing and Processing on HBase and MapReduce

16 Apr 2015, 12:00
15m
Auditorium (Auditorium)

Auditorium

Auditorium

oral presentation Track2: Offline software Track 2 Session

Speaker

Ms Xiaofeng LEI (INSTITUE OF HIGH ENERGY PHYSICS, University of Chinese Academy of Sciences)

Description

In the past years, we have successfully applied Hadoop to high-energy physics analysis. Although, we have not only improved the efficiency of data analysis, but also reduced the cost of cluster building so far, there are still some spaces to be optimized, like static pre-selection, low-efficient random data reading and I/O bottleneck caused by Fuse which is used to access HDFS. In order to change this situation, this paper presents a new analysis platform for high-energy physics data storing and analyzing. The data structure is changed from DST tree-like files to HBase according to the features of the data itself and analysis processes, since HBase is more suitable for processing random data reading than DST files and enable HDFS to be accessed directly. A few of optimization measures are taken for the purpose of getting a good performance and as well as a customized protocol is defined for data serializing and desterilizing for the sake of decreasing the storage space in HBase. In order to make full use of locality of data storing in HBase, utilizing a new MapReduce model and a new split policy for HBase regions are proposed in the paper. In addition, we establish a dynamic pluggable easy-to-use tag (event metadata) based pre-selection subsystem. It can assist physicists even to filter out 999‰ uninterested data, if the conditions are set properly. This means that a lot of I/O resources can be saved, the CPU usage can be improved and consuming time for data analysis can be reduced. Finally, several use cases are designed, the test results show that the new platform has an excellent performance with 3.5 times faster with pre-selection and 20% faster without pre-selection, and the new platform is stable and scalable as well.

Primary author

Ms Xiaofeng LEI (INSTITUE OF HIGH ENERGY PHYSICS, University of Chinese Academy of Sciences)

Co-authors

Dr Gongxing SUN (INSTITUE OF HIGH ENERGY PHYSICS) Mr Qiang LI (INSTITUE OF HIGH ENERGY PHYSICS)

Presentation materials