Speaker
Description
The community's adoption of Hist and boost-histogram, both part of the Scikit-HEP software stack, leads to increasingly frequent work with dense, high-dimensional histograms. These histograms become a memory bottleneck in modern large-scale high-energy physics (HEP) analyses because they become exceedingly large due to the cartesian product of all axes.
To solve this problem, we propose “Histogramming as a Service” for large-scale HEP analyses. The core idea is to offload the filling of histograms from each worker in a distributed environment (e.g., batch systems) to a single dedicated server. This significantly reduces the overall memory requirements, as not every worker needs to maintain a copy of a histogram; instead, a central histogram is stored on the server.
Histogramming as a Service offers other advantages besides reducing memory usage: Filling the server-side histogram remotely can be a non-blocking operation, allowing the rest of the HEP analysis to continue while the remote histogram is being filled. In Dask workflows, this also eliminates the need to reduce each worker's output histogram, which otherwise could lead to unexpected memory spikes during accumulation. Finally, alternative histogram implementations can be served that, for example, enable direct filling on-disk, thereby effectively eliminating scaling limitations on histogram sizes.
We will introduce the concept of Histogramming as a Service, discuss its implementation design, and present large-scale benchmarks measured at the coffea-casa computing infrastructure.