What about jax, instead of cupy? +3
Is there a way to check if there is any race condition? +3
The very first thing you did, selecting a GPU kernel - how do you do this outside a Jupyter notebook. How would you do it from a plain python .py file? +4 -2
Is there any siginificant advantage of using numba instead of pycuda? +2
can that overflow bin be omitted? (to the right of the end of the histogram?) +2
Is pandas compatible with cupy?
comment: @joosep pata, there is a searchsorted implementation in thrust, if you expose that e.g. via ctypes you have it readily available.