iminuit

Part of scikit-hep, offering a thin Python layer around Minuit2.

numba-stats: auto-vectorization implementations of pdf available in Python, numba-friendly.

Q&A

Comments (Jonas R.)

Concerning removing without deprecation. BatcMode is equivalent to backend==CPU. It was renamed because it was confusing for people. The improvements in performance are included, but already existing code is not broken due to the rename.
There is no way to parallelize the new backend, for now. The old numCPU affects only the legacy backend, the parallelize option is related to efforts by ATLAS people. For the new CPU backend approaches to parallelize so far have not been successfull, it is interesting that iminuit+numba can take advantage of that with lots of data points.

Q: (Vincenzo) the right side of the plot indeed puts iminuit+numba+parallelization a few factors faster than RooFit's CPU backend. But for the majority of the plot the difference is a factor ten slower for iminuit v.s. RooFit. How much of an actual impact does the case >=10^4 data points provide?
A: (Hans) That becomes really relevant any time that the user experiences a "slow fit". That's where parallelization becomes important, so indeed it can have an impact.

Q (Danilo) What is the runtime of this benchmark?
A (Hans) not much, a couple of seconds in the worst case.

Q: (Danilo) what is the effect of fastmath?
A: (Hans) Very beneficial if available. Reassociation properties, we want the compiler to reorder instructions in such a way that can parallelize best. If you don't turn this on, then the order of computations (even if you say that a certain loop can be done in parallel) still has to be done with a strict ordering. fastmath option tells the compiler that they can reorder.

(Danilo) Making the burden of float-ops lighter is a pre-requisite for vectorizing. I was wondering if the speedup seen is due to a polynomial implementation of logarithm? One could then take the VDT log and compare "apples to apples". Also the transcendental math functions are changed when using numba?
(Hans) Yes.
(Danilo) Ok, if this is what numba does than it would be interesting to see it done in RooFit.
(Jonas R.) RooFit doesn't do it yet.

Q (Hans) Even for CPU backend you use the Kahan summation?
A (Jonas R.) Yes.
(Hans) I would expect RooFit to be faster then
(Jonas R.) There is a lot of caching in RooFit. With small datasets, doing these checks whether something needs to be recomputed is expensive. In some way, the performance of RooFit and numba depends on many different factors, so it's interesting to see they perform similarly.

Comment (Lorenzo) Codegen is a bit slower.
(Jonas R.) Yes, we see the effects of JITting and a lack of vectorization in the generated code.

Q (Jonas R.) For simple PDFs we can usually use SIMD instructions. In complicated PDFs, will it be just as eeasy?
A (Hans) For some of them. I like Student's T function as a more flexible alternative to Gaussian, with that one it should work. But it really depends on the user app.

Q (Jonas R.) Are you in touch with people that use numba-stats for more complicated analyses?
A (Hans) In our own papers for LHCb. I don't have a good overview of who's using it. What people like to use is the double-sided crystal ball, this can be implemented with exponentials and it should be parallelizable (implementation available in numba-stats).

Q (Jonas R.) You have nice tutorials on iminuit documentation where you show RooFit PDF with iminuit. Do you think it would be useful if we support the easier export of RooFit PDF to Python functors so it can be directly plugged to iminuit? RooFit for the model building, iminuit for the minimization.
A (Hans) I don't think it's that useful. You can use Minuit2 and get exactly the same. I don't think there's much iminuit can add once you have the RooFit PDF and models built. I looked into it because people were wondering about if it's possible at all.
(Jonas R.) The way to retrieve results from RooFit minimizer is not really Pythonic. You need to read some docs before you get the fit parameters into e.g. numpy arrays. So I think something might be gained there.

Q (Jonas H.) codegen vs codegen/nograd I assume it's for clad. The rest of the curves use numerical gradients?
A The red curve is the only one that uses AD, much slower because the model doesn't have many parameters. The constant cost factor is roughly 7, so you only get benefits from AD if you have many parameters.

Comment (Lorenzo) For RooFit we should try to understand better this parallelization.
(Jonas R.) if you EnableImplicitMT it will parallelize, but you will probably need many events to see benefits. OpenMP had less overhead than TBB, but nothing was clearly showing as the best solution. At some point maybe the code generation approach is better because I can put an openmp pragma around the whole for loop.