July 31, 2018 to August 6, 2018
Maynooth University
Europe/Dublin timezone

Big Data Software in High Energy Physics

Aug 2, 2018, 5:50 PM
Hall C (Arts Bldg.)

Hall C

Arts Bldg.

Talk H. Statistical Methods for Physics Analysis in the XXI Century Statistical Methods for Physics Analysis in the XXI Century


Jim Pivarski (Princeton University)


For decades, high-energy physics (HEP) had been on the forefront of big data technology, developing techniques to explore and analyze datasets too large for memory that were revolutionary when they appeared in other fields years later. Today, that dominance is ending, and I argue that it's a good thing. The rise of web-scale datasets and high-frequency trading has interested the commercial sector in data analysis, driving the development of professional yet open-source software with a much larger userbase than HEP— software that we do not need to develop or maintain ourselves.

However, using this software in HEP analysis isn't trivial, at least not yet. Some differences in conventions have to be bridged, such as HEP's C++ toolset and the preponderance of Python, R, and Java/Scala in industry. I will show some of this "plumbing" software for Python (PyROOT and uproot) and Java/Scala (Spark-ROOT). But there are also deeper differences in emphasis between the two communities: our nested data model vs. flat data frames, our focus on histograms and basic plotting, and the industry's satisfaction with merely predictive models. After showing illustrative examples and how to use them, I will conclude that we still have work to do, developing some software on our own, but can significantly benefit by working within the conventions of the larger big data community.

Primary author

Jim Pivarski (Princeton University)

Presentation materials