Madgraph5 GPU development - NB Room 31/s-23

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map
Zoom Meeting ID
63816708295
Host
Stefan Roiser
Useful links
Join via phone
Zoom URL

# Madgraph dev meeting Tue 23 Jan 2024

Present/remote: Stefan Roiser, Andrea Valassi (minutes), Carl Vuosalo, Zenny Wettersten, Nathan Nichols, Olivier Mattelaer (apologies, ~10 minutes late), Stephan Hageboeck (apologies, ~40 minutes late)

## Round table

### CV

CV shows the slides attached to the agenda for A100 tests.
Based on master as of November 2023 approximately.
Best performance is at around 1100 registers.

SR which process?
CV ggttgg
SR normally 256 is the hardware?
AV yes exactly the A100 specs is 255 registers
CV we modify max registers in the makefile

AV actually you can check with nsight how many are used, I have some scripts that dig that out
The number is actually corrrect, because for complex processes it gives 255, but for eemumu it is around 100 for double and 50 for float.
NN had the exact same experience

AV look at this log 
https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
It contains "==PROF== Profiling "sigmaKin": launch__registers_per_thread 255"
It is produced with https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/tput/teeThroughputX.sh

SR splitting kernels would help in that direction
NN doing some investigation in that direction, no results so far

OM maybe eemumu?
AV probably too simple
SR but we reach 255 in eemumu anyway, right?
AV no, we reach less, actually ~160/120 for double/float, not ~100/50, see below
"==PROF== Profiling "sigmaKin": launch__registers_per_thread 166" 
in https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
"==PROF== Profiling "sigmaKin": launch__registers_per_thread 117" 
in https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt

### NN

NN Did some minor performance improvements, with around 10% improvements.

NN Did not do much else. Waiting for the HIP plugin, how is this coming along?
SR very good, we were just waiting for this
NN if HIP is there, it will give me a little guidance
SR can you remind me, did we say that it would be possible to launch the sycl kernel as done for cuda and hip?
NN yes this what we want to try

### ZW

Sitting in Milano with Marco Zaro trying to do the fixed order for NLO.
This is going well, we are making a lot of progress.
But it is still crashing when we vectorize with vector size > 1.
We should discuss about the next trip.

OM one question, you say you are doing openmp, are you doing this for loops as well or you do not care?
ZW we do not care about that for the moment

AV vectorize meaning cudacpp? or just fortran?
ZW no just vectorizing the fortran

AV eventually with cudacpp will you need SA or madevent or what?
OM for real based on SA, for born will be madevent because you need multichannel (but more complex for reduced born), so you will need both
ZW anyway it is a separate driver, not all the LO
SR so what is needed in helamps is yet another python?
OM ixxx etc is exacty the same, but the way it is called will be different
OM at LO I had a double exporter, we might need that at NLO, not there yet

SR maybe you can present some slides next time?
ZW yes can do that

### OM

OM Not sure what I reported last time, but now split warps so that each warp has its choice (eg of channel).
This changes the cudacpp interface and am working with SR on this. Let SR discuss this.

### SR

SR working with OM on channelids so that they can be put in the bridge.
Now put this in the bridge and working through the cudacpp. Adding a mask for channels.
One thing that's left is tha calculation for choice of random color at the end, need to go through that.
I can prepare a couple of slides for the next meeting.
Now we have also unsigned integers for memory access, not only floats/doubles.
But had to stop coding at some point.
OM beware that you might have different leading colors depending on different channels of integration.

SR By the way nothing is done in the code generator yet.
OM can also help with pair programming for code generator

### AV

First, have done some work, but very inefficiently in the last 5 months due to absences and other issues.
Also not completely sure for how long or in which way I will continue working on this project.

Second, AV shows slides on makefiles etc
[note by AV: these minutes are incomplete as I did not write down the full discussion I was taking part in]

SR one doubt, does 774 (GpuAbstraction) also need 775 (different builds for cuda and cpp targets)?
AV you may be right, did not have time to look at 774 in detail, because we had many discussions on 775 and I stopped there
AV checks in real time on the PR: yes you are right and I was wrong on this, 774 does include hipcc building, 
so strictly speaking Nathan could work on top of 774 even without 775.
That said, I would very much prefer having also 775 if we have 774: otherwise, we have builds that always include cpp and hip, 
and it gets even messier when cuda libraries are present. And for including sycl on top it would be a mess.
NN agree with AV, would prefer to have 775 also in immediately at the same time as 774

Some discussion on 753
SH [joined late during AV's presentation]: wanted to have the changes in 753 to do some profiling,
but changed plans and will not be profiling madgraph, so choose any option that makes the project advance faster.
AV just to make it clear, I would include some of the changes in 753, but not immediately (would add 775 first)

More discussion.
AV I vote for 774 first (with extra fixes), 775  later, and 753 only after these two... then everyone say what you prefer, and someone will take a decision.
AV I am available to do the work on 774 if you want 
NN also vote for this, having 774 and also 775 

OM lets go for 774 first, and then rediscuss 775 and 753
OM please AV go ahead for 774
AV I will do 774 (it would include a PR based on that, to do the rebasing and run the CI tests)
OM/SR ok for that
AV suggest LUMI for AMD?
OM/SR yes suggest LUMI

## AOB

SR was thinking of recording these meetings from next time onwards, any objection?
No objection to the proposal.

 

There are minutes attached to this event. Show them.