Madgraph5 GPU development

Europe/Zurich
513/R-070 - Openlab Space (CERN)

513/R-070 - Openlab Space

CERN

15
Show room on map
Videoconference
Madgraph5 GPU development
Zoom Meeting ID
63368133283
Host
Stefan Roiser
Useful links
Join via phone
Zoom URL

# Madgraph on GPU dev meeting, Tue 07.02.2023

Present: SR, ZW, JT, SH, AV, OM, TC, NN, CV, WH

## Round table

NN: shows some scaling plots.
(AV: please upload them to indico)
- CPU is on a Skylake 8180.
There are curves with vectorization, with 2, 4, 8, 16 doubles per vector.
(AV One thing to check is why sycl increases as nevt increases (in ST), 
while CUDA is flat... naively one would think this should be flat).
In ST it seems SYCL is outperforming CUDA (only at high nevts).
When going MT (using the recent draft OMP from AV in cudacpp),
SYCL is still a bit better than CUDA, but much less than ST (AV: surprising),
and CUDA becomes much closer as you add more gluons (tested up to ggttgg).
(AV: would be nice to also the plots for ggttggg next time).
- GPU is an A100.
See sycl better than cuda in ggtt and ggttg but worse on ggttgg.
- About vectorization
AV: can you confirm the implementation is again fptype_v, with typedef using sycl vector
rather than gcc compiler vector extension?
NN: yes essentially it is like this, then sycl uses a native vector if it can find one,
but otherwise it emulates it

TC: ntr

WH: was very busy but did some tests of susy, trying to get the same distributions

OM1: merged some of Andrea's MRs

OM2: also did some work on trying to clarify the parallelism level in madevent,
this might require some changes in the bridge eventually.
Essentially should not have thousands of events with the sanme channelid.
You should give many warps of 32 events, where each warp has the same channelid,
but then different warps can have different channelids.
The bridge API should get not a scalar channelid, but a channelid
that is different for each warp.
SH: modern GPUs can digest some thread divergence, we can try that.
AV: yes we can try thta in GPU, but it would break CPU SIMD.
So please give warps of events where channelid is the same in each warp!
Rephrase, on GPU we can digest some thread divergence, but since we 
can easily design to hav eno thread divergence, lets have none!

OM3: working also on a very high priority gridpack issue in CMS.
New generation with WZ at NLO, previously only at LO.

OM4: got a comment from StefanoF that madgraph4gpu sounds like Madgraph4,
we should keep that in mind...
SR: however would not change,

CV: still waiting for some code to run performance tests on with T4 GPUs.
CV: also confirm that the gridpack issue in CMS is really a big issue!

AV1: thanks to OM for merging my upstream MRs!
AV2: mainly preparing the CAF seminar of tomorrow.
Apologies, I am very late with the slides and will probably not be able to circulate them in advance.
Will focus on cudacpp, giving only few points for sycl (inclusing Jorgen's plots from today,
and mention Nathan's work on sycl).
AV3: received a request for the ACAT paper by beginning March, not sure what to do.
We have the ICHEP paper published the day before ACAT started...

JT: shows some plots for cuda vs sycl on a100.
Seems consistent with what Nathan has shown: sycl better in eemumu, ggtt, ggttg,
but cuda better in ggttgg and in ggttggg.

ZW: spent last two weeks writing an XML parser to remove boost dependency.

SR: preparing proposals
SR: there is an openlab workshop, asked AV to give a presentation

## Unweighting on GPU (SH)

Work triggered by discussions with OM during the hackathon.
Once you speed up the ME on GPU, it becomes 1%.
The flamegraph shows interesting stuff for the remaining 99%.

One idea could be to do the unweighting on the GPU.
First step could be to compute the maximum weight.
Tried it on GPU, very fast. Useful to move it from Fortran?

AV: where is it in the flamegraph?
SH: it is in the unweight and write_les_houches.
OM: there is also some part in store_event/read_event,
there is a lot of writing to file that is done, could be removed

AV: could also try to port x_to_f_rg (ie gen_mom) to GPU,
this shoul dbe the equivalent of rambo (input random, output momenta).

OM: yes, and also setscales, that computes the evt by evt scale G
(which is then passed to GPU where the couplings are computed from G)

AV: would also like to run madevent on many CPU cores, which all share a single GPU
SH: this is a good idea in general, but here it might not work
if what madevent is doing is writing to /tmp, then many cores will not go faster

## AOB

SR: will be away in two weeks, you can gp ahead anyway
OM: would prefer to keep Tuesday
SH: meeting at 4pm on Tue
SR: ok for tuesday 21 then, 3pm to 4pm

 

There are minutes attached to this event. Show them.