November 29, 2021 to December 3, 2021
Virtual and IBS Science Culture Center, Daejeon, South Korea
Asia/Seoul timezone

Evaluating Query Languages and Systems for High-Energy Physics Data

contribution ID 679
Nov 29, 2021, 7:00 PM
20m
S221-A (Virtual and IBS Science Culture Center)

S221-A

Virtual and IBS Science Culture Center

55 EXPO-ro Yuseong-gu Daejeon, South Korea email: library@ibs.re.kr +82 42 878 8299
Oral Track 1: Computing Technology for Physics Research Track 1: Computing Technology for Physics Research

Speaker

Ingo Müller (ETH Zurich)

Description

In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platforms and compare them with ROOT's RDataFrame interface executing the Analysis Description Languages (ADL) benchmark. We identify 16 language features that are useful in implementing typical query patterns found in HEP analyses, categorize them in terms of how essential they are, and analyze how well the different query interfaces implement them. The result of the evaluation is an interesting and rather complex picture of existing solutions: Their query languages vary greatly in how natural and concise HEP query patterns can be expressed but the best-suited languages arguably allow for more elegant query formulations than RDataFrames. At the same time, most of them are also between one and two orders of magnitude slower than that system when tested on large data sets. These observations suggest that, while database systems and their query languages are in principle viable tools for HEP, significant performance improvements are necessary to make them relevant in practice.

Significance

The talk presents the outcome of a collaboration of particle physicists at the University of Washington and database systems researchers at ETH Zurich. The results of the study are completely novel and have not been submitted or published before (except for the reference below).

References

In revision for the Proceedings of the VLDB Endowment Vol. 15 (https://vldb.org/pvldb/vol15-volume-info/). Will be presented at the 48th International Conference on Very Large Data Bases 2022 (VLDB, http://vldb.org/2022/) if accepted. Preprint available on arXiv: https://arxiv.org/abs/2104.12615.

Speaker time zone Compatible with Europe

Authors

Mr Dan Graur (ETH Zurich) Ingo Müller (ETH Zurich) Mason Proffitt (University of Washington (US)) Dr Ghislain Fourny (ETH Zurich) Gordon Watts (University of Washington (US)) Prof. Gustavo Alonso (ETH Zurich)

Presentation materials