Madgraph5 GPU development
# Madgraph dev meeting 03.10.2022
Present: SR, SH, AV, WH, JT, WH, NN, CV
Apologies: OM, ZW
# Round table
Last week SR, SH, AV, ZW, OM were at a hackathon in Lugano.
A lot of work by all of us with the help of mentors.
## SH
SH: profiled the Fortran code, then OM and ZW removed a large chunk of MLM code,
the Fortran is a factor two faster in time now (and also reduces memory).
Also got some improvements from using shared memory in some cases.
SH: can bring the slides to the agenda and discuss them
TC: any discussion with NVidia expert about supporting C++ new stuff directly?
SH: not really, but we tried the HPC kit and compiler, more complicated than it looks.
SR/AV: yes many issues on different HPCs, especially the Nvidia cluster (and nvc++ there)
AV: we should maybe download the nvc++ locally on our machines and add support for that
WH: some experience installing the nvc++ locally
TC: generally install a compiler locally, but do not use nvc++ in general
## AV
At the hackathon mainly worked with Olivier and Zenny, so report also the things they did:
- OM/ZW work on MLM also reduced fortran memory, so we can run more events in the gpu grid
- worked with OM on splitting the big kernel (that does a helicity loop) on many kernels that do one per helicity: 30% slower for simple processes, but a bit faster for complex processes (and especially it will allow us to run fewer events and fill the GPU with many helicities in parallel
- also worked with OM on the memory structures for "jamp", for separating the Feynman diagrams and color algebras (eventually we may even resume splitting the individual FFVs, but unlikely)
- OM had great idea to run color matrix in single precision, ZW did some implementation
- also discussions AV/OM/ZW on algorithmic changes to color algebra: you do not gain a factor two by separating (A+iB)M(A-iB) as AMA+BMB, but you can get a factor two inside each AMA by exploiting the fact that M is symmetric (rewrite ai*mij*aj+aj*mji*ai as 2*ai*mij*aj)
- (for reference, note also OM's work with Andrew Lifson on additional improvements from symmetries and permutations in color algebra)
## WH
Not much to report, did some tests with nvc++ on a local machine.
Trying to use std::par for some calculations.
## JT
Compiling with SYCL for Intel GPUs.
Had some issues finding the right flags for the right device.
NN: can help you with that, also had very similar issues.
## TC
Working with NN on the kokkos bridge.
## NN
Finished integration with madevent for sycl, tested wour five processes.
Compared the results with fortran, using AV's scripts.
Got the same cross sections and LHE files from AV's tmad scripts.
Will also have a look at the performance numbers.
WH: can we produce gridpacks? one colleague at Argonne is doing GTF for this, can ask her
AV: not yet, before we do that we should complete the random color and random helicity,
but then we also need a lot of cleanup for the fortran code generation
When we have a meeting with Josh we should let you know
## CV
Will have some time in the coming weeks to get some performance numbers,
plan to run on cuda and alpaka, on T4s.
## SR
SR: Worked on color matrices at the hackathon, testing both cublas and cutlas.
The advantage of cublas is that it pulls in no dependencies.
Focusing on the "batched" version of cublas.
SR: will know next week about the proposal CIPEA
# Hackathon slides SH
Note the two flamegraphs before/after the changes by OM/ZW on MLM
Note also the roofline plots, we are doing very well.
The new plot shows also the appearance of single precision in the roofline after switching the color algebra to single precision. (Clarified with one mentor that the increase in arithmetic intensity is correct).
WH: how would you optimize the GPU grid size in production?
AV/SR/SH... discussion, many options possible.
AV: anyway if many processes share a GPU, they use it sequentially IIUC
NN: yes confirmed, they go to different streams
# AOB
SR: next meeting? Monday 17 3pm there is a talk at CERN by the DG
Lets try 5pm on Monday 17.