Jul 9 – 13, 2018
Sofia, Bulgaria
Europe/Sofia timezone

Columnar data processing for HEP analysis

Jul 10, 2018, 11:00 AM
Hall 9 (National Palace of Culture)

Hall 9

National Palace of Culture

presentation Track 6 – Machine learning and physics analysis T6 - Machine learning and physics analysis


Jim Pivarski (Princeton University)


In the last stages of data analysis, only order-of-magnitude computing speedups translate into increased human productivity, and only if they're not difficult to set up. Producing a plot in a second instead of an hour is life-changing, but not if it takes two hours to write the analysis code. Fortunately, HPC-inspired techniques can result in such large speedups, but unfortunately, they can be difficult to use in a HEP setting.

These techniques generally favor operating on columns— arrays representing a single attribute across events, rather than whole events individually— which allows data to stream predictably from disk media to main memory and finally to CPU/GPU/KNL onboard memory (e.g. L* cache) for prefetching and sometimes allows for for vectorization. However, the need to work with variable-length structures in HEP, such as different numbers of particles per event, makes it difficult to apply this technique to HEP problems.

We will describe several new software tools to make it easier to compute analysis functions with columnar arrays in HEP: array-at-a-time I/O in ROOT ("BulkIO") and Python/Numpy ("uproot"), compiling object-oriented analysis code into columnar operations ("oamap" for "object-array mapping"), and storage solutions with columnar granularity. We will show performance plots and usage examples.

Primary authors

Jim Pivarski (Princeton University) Peter Elmer (Princeton University (US)) Jaydeep Nandi David Lange (Princeton University (US))

Presentation materials