Hackathon discussion

Europe/Zurich

# Wed 21.09.2022 Hackathon chat

Present: SR, OM, AV (notes), SH, ZW, VincentMaillou

Discussion on compilers
- SR made a script for raplab to choose between gfortran 8 and 9
- SH used that on raplab and also tried pizdaint, will send some results when tests completed 
- AV iterating on nvc++ on raplab, fixing linking errors, also installed ccache locally

Discussion on OM modifications to signakin
- Presently sigmakin is one kernel call: it iterates on helicities and calls calculate_wavefunction for each helicity
- First step OM is now doing the first step, making each calculate_wavefunction for a different helicity a different kernel
 > AV ok so old model for memory access was each thread asks "which event am I?", new model is "which event and helicity am I?"
 > OM we need to do sum of all helicities now, reduce across threads, shared memory?
   VM/SH ideas on how to use shared memory to do this reduction
   AV calculate_wavefunction presently does a "+="... SH not a good idea to do an atomic += on global memory
 > discussion on how to spread these amongst threads and blocks... 
   SH/OM keep all helicities of event in one block if we use shared memory (shared only by threads within a block)
   AV now we have coalesced memory access with different events in the same warp
   OM even better to have different helicities in the same warp, so you access less memory (use shared memory?)
   consensus that we should keep different options open
   AV there is max threads per block 1024... SH/OM we can do a loop if helicities are more than 1024
 > to be understood, are memory access functions ok now?
   [AV a posteriori: we need some changes anyway to kernelAccess functions, one kernel becomes one event and one helicity]
- AV suggest to do the rest in steps
 > Second step is splitting calculate_wavefunction into two kernels: all FFVs, and just color algebra
   This may require also some changes to access functions, but not much (only jamps need to go to global memory)
 > Third step is the splitting of FFVs (and jamp+=) to different very small kernels
   Discussion on FFV and jamp+=, AV suggest to do wrappers (one wrapper is one kernel, does one FFV and also one jamp+=)
   Discussion on thread divergence, AV now it will no longer be 100%, how to measure? VM ask the profiler channel (*)
 > SR cuda graph? AV suggest as a fourth step... you can do the previous one also without cuda graphs
   After all calculate_wavefunction now is already a graph, it calls O(1000) FFVs in a given order

AV we should introduce some timing of color algebra in cudacpp
OM show draft new paper with Lifson, reducing color algebra timing (in fortran) by x2 or more
using algebra of permutations (reduce caclculations by accessing permutations of previously computed).
Example in gg to 5g, color algebra timing is 72%(!), new method will bring it down to 44%.
Then it still makes sense to work to reduce the other 50% (eg splitting kernels of FFV individually).

AV show slides from https://indico.cern.ch/event/1170924/contributions/4954511/
Slide 22 we can keep the same GPU throughput over several CPU cores
There is even a strange increase, many CPU cores seem better than only one core?! not clear why... VM can profile that! (**)

Brainstorming for future developments, summary of a few directions
- Take slide 21 of AV talk above to make some points
 Production was ~1260sec=60sec(mad/fortran)+1200(ME/fortran)
 With GPU offload of ME it becomes ~72sec=60sec(mad/fortran)+12(ME/cuda)
 To speed up the overall timing, we now need to speed up the madevent 60 sec on CPU
 Note that the 12s on cuda uses 8k GPU grid size, which is suboptimal, but limited to 8k from madevent fortran memory
 (if we could use 16k grid sizes the 12s on cuda would already go down maybe to 6s)
 [Example of fortran memory https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/gg_ttgg.mad/SubProcesses/cluster.inc]
- 1. AV to reduce the madevent 60 sec on CPU, we can use several CPU cores
 (unless OM can find ways to speed up fortran; SR suggests can we offload more to GPU?)
 See slides 22 and 24: can keep the GPU busy also by splitting across CPU cores, small overhead
 For instance if we use 64 cores and reduce that to 1s, there is still interest in speeding up the 12s on GPU!
 But using many processes in parallel we may have even more problems with fortran CPU memory...
- 2. OM an alternative to reduce the CPU memory in fortran per event is to use fewer events!
 If we go to the model where each GPU thread is one helicity of one event, rather than all heleicities in one event,
 we reduce the total number of events in flight (and the fortran CPU memory) by the number of helicities
 Example (ggttgg?) with 32 helicities, instead of 8k events we can use 250 events times 32 helicities.
 (And maybe we can even push to 500 events with 32 helicities, i.e. a more optimal grid size of 16k).
- 3. There is still a lot of interest in further speeding up the ME calculation in cuda then!
 SR tensor cores to speed up the color algebra
 OM/AV to split kernels, eventually to individual FFVs (and SR eventually to cuda graph, maybe)

(*) AV a posteriori asked this on the profiler channel:
"We are using sm__sass_average_branch_targets_threads_uniform.pct currently to measure lockstep processing in our kernels. This is fine now because we have no thread divergence at all, so this pct is always 100%. However we will probably move to a model with smaller kernels, so there will be some amount of thread divergence: we expect this will be extremely limited, with almost invisible effect on overall performance, but it would still be good to quantify that. Is the metric above the best one, or can you suggest a better one? Suppose I have a kernel which takes 100s because all is in lockstep. Suppose now I have two branches, but they are limited to 1% ie 1s, and suppose they still continue to take 1s. I imagine (correct?) the total time will now be 101s (99s all in sync, plus 1s for one branch with some threads waiting, plus 1s for the other branch with the other threads waiting). I would like a metric that gives me 100s/101s ie 99% (more or less the average computation time used, divided by the average wall time).. is the metric above providing me this essentially? Thanks!"

(**) AV instructions to reproduce the plots
The driver of tests is https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/gpudriver.sh
Everything is a singularity container, you get the help from
singularity run -B <localresultsdir>:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -h
The script for the plots is https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/bmkplots.py
This means you would need to run "<profiler> singularity <image> <options>", I hope that would work?
Otherwise one must reengineer this scripts and inject the profiler there, more complex
https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/tput/throughputX.sh
 

There are minutes attached to this event. Show them.
The agenda of this meeting is empty