175th ROOT Parallelism, Performance and Programming Model Meeting
PPP 23.01.2025
Thread-safe histograms in RDF
[Jakob]: These are global queues?
[Stephan]: No, two queues per every histogram. and a lock
[Jakob]: what if the flushing takes longer than making the other queue full? If both queues are full.
[Philippe]: Same algorithm as the TBufferMerger. There is an equation. If number of workers is larger than the time it takes to actually fill, you end up with the queue growing infinitely. The one thread that does it, cannot keep up with the amount of work being generated for the other 256 threads.
[Jonas H]: I showed that all of these queueing doesn't work for TBufferMerger.
[Stephan]: Queues are not growing indefinitely, they have a fixed size. There is a conditionally busy wait, depending on what is on the computation graph.
[Danilo]: The mechanism you are showing with queues is generic, can work with any wrapped histogram?
[Stephan]: Yes, the thing writes down the arguments of Fill.
[Danilo]: For Jonas H., you formalised the statement that a queue in a MT program is either always empty or always full, so it's not necessary?
[Jonas H.]: either you don't use it or you're always filling it and you're not fast enough to consume it.
[Vincenzo]: I would have expected the case of 1 TH3D with MT using more memory than the plot on slide 8 is showing.
[Jakob]: On slide 10, I can see that the blue lines grow in lockstep.
[Stephan]: In this case there is one histogram per physical core on the machine
[Philippe]: So the speedup is there because every histogram fits in the local cache of the thread?
[Stephan]: No because the histograms are visited by every thread.
[Danilo]: But there is one slot per histogram, so it's filled with a range of events
[Jonas H.]: How much memory does the machine have?
[Stephan]: 32 GB
[Jonas H.]: So it's swapping, didn't you say 80 GB on the previous slide?
[Stephan]: No it's 80 MB per histogram.
[Josh Bendavid]: For context, I guess people are aware of what boost histogram did, with atomics. That has other implications, but in terms of future strategies, I think it would be very useful to have a direct comparison between locks, queues and just using atomics. Maybe one way to do that would be to have a trivial struct with a vector.
[Jonas H.]: Yes I did this test and the conclusion was that locks never make sense over atomics.
[Stephan]: Since you mentioned boost histograms, in the long-term vision ROOT would have a histogram that would use atomics and run in RDF by default. This is an intermediate solution, while the ROOT 7 histogram is being developed. Until that time people should still be able to run their RDF analysis without running out of memory.
[Josh]: So you're basically already assuming that in an ideal world the atomics would do a better job. Then my other comment is that at least it's worth testing how these behaves with a much larger number of threads. That is starting to be relevant. In those cases you might want the ability to have some intermediate behaviour with one copy of the histogram every N threads (8,16 to be seen)
[Philippe]: I agree with Josh, and another comment. The problem now is running out of memory. You can calculate that.
[Stephan]: Easy to do that for a single histogram, more difficult to do that on the whole computation graph.
[Philippe]: If the actual solution is using atomics, i.e. new histograms, there is no need to spend time on making the interface work with the "auto" option. At the same time we need to communicate to the user when this is needed.
[Jonas H.]: Even then we will have another problem. My conclusion so far is that you cannot decide from a single histogram size what to do. The really interesting case is when you have many histograms and then you need to compute the total working size. For large histograms you cannot be the atomics. There are so many histos and they're so large that the chances of running into contention are slim. The "auto" option is really tricky, regardless of the implementation.
[Danilo]: Whatever heuristics we invent, we will never be able to make everybody happy. We can give the option and then physicists will decide.
[Josh]: I think it's true that the graph can figure out what the total working size is at the end. I think what was suggested before that we get a warning is useful. We should know as users that potentially our histograms will get us out of memory
[Jakob]: I'm not thrilled about the option in the histogram model. It tells you something about the histo itself, not how it's being filled.
[Giacomo]: Can it be an action for the computation graph?
[Josh]: ANother option is a pythonization of the histo call.
[Vincenzo]: I personally like adding a struct as optional argument to the Histo* methods.
[Philippe]: I like Giacomo's idea.
[Giacomo]: But that would clash with the "auto" option.
[Jonas H.]: Is there a reason why, under this circumstance, not clone an Histo3D? Imagine you have the computation graph defined. You cannot do it in the current way because you run out of memory. So you have to go through this feature. So why would you make it an option of every histogram so that you can say "I clone this, but not that histogram?"
[Giacomo]: Indeed, and I don't like my previous idea anymore that much, giving the user too much granularity. This is probably a separate global option to say "do not clone histograms for this entire computation graph".
[Danilo]: We know that in the end, IIUC the discussion so far, the atomics for the bins are the best implementation option. Given that everything here is internal, why don't we go directly instead of having this system with the queues to something which is an array of atomics then construct it internally as a TH3D? Everything would happen in RDF anyway. Transparently we would switch to the new histograms when they're ready. I'm thinking about also the work that Jonas H. presented, I think that work had statistics.
[Jonas H.]: No, they did not have that. And I also believe that they do not belong in a histogram data structure.
[Jonas H.]: What I see is that the thing that costs memory is the bin storage. The number of entries you can reconstruct them from all bins, number of effective entries (takes the weight into account).
[Stephan]: You need both the weighted and non-weighted entries.
[Danilo]: In RDF we can discuss all of this because it's happening internally.
[Jonas H.]: Danilo's proposal is basically implementing a new histogram inside of RDF.
[Lorenzo]: You can always go back to the old one if needed.
[Jonas H.]: But the user may want some information that this internal histogram does not have by default, so that needs to be reimplemented.
[Philippe]: Coming back to Danilo's idea, what about just copy-pasting the implementation of the current action (Histo3D) to a new internal action Histo3DAtomics and use the atomics then?
[Jonas H.]: You will get contention on the statistics.
[Lorenzo]: Compute afterwards.
[Stephan]: You can't for all of them, e.g. number of fills.
[Lorenzo]: You have the number of weights.
[Stephan]: You have a file with systematics, number of entries = 1K and sum of weights is 50, histogram needs to support all cases.
[Vincenzo]: We have to think about the direction for the API, doing it for TH3D is great performance-wise, but we cannot introduce it just for this operation
[Danilo]: Agree, we need to better understand direction
[Stephan]: Let me reasses with the atomics in the TH3D Fill