Madgraph5 GPU development
MG dev meeting Mon 08.08.2022
https://indico.cern.ch/event/1185543/
Present: SR, SH, AV, ZennyWetterstein, JorgenTeig, TC, OM, CV, WH
Excused: NN
# Introductions
## Zenny
Just started as doctoral student with Vienna for 3 years.
Main focus will be NLO calculations.
## Jorgen
Technical student for 9 months from Norway.
## Taylor
ATLAS physicist from Argonne, now mainly working on generator software since 10 years.
(Also working on ML/AI workflows, eg tensorflow, not only HEP but eg also fusion/plasma, climate).
Kokkos port of MG.
Nathan (postdoc) mainly doing sycl port on Argonne experimental HPC using PonteVecchio chips.
SR: discussed with Maria Girone, we should get Intel GPUs end 2022 or early 2023.
TC: note CERN also has access to Intel devcloud.
## Olivier
Research scientist in Louvain, theory then phenomelogy. Author and main maintainer of MG.
## Carl
CMS physicist, sysadmin at Wisconsin. Mainly doing computing now.
Testing GPU builds of MG, big interest on GPUs at Wisonsin.
Difficult to get the right sycl compiler to use with MG.
SR: Jorgen also working on Intel builds.
## Walter
ATLAS member at Argonne since 10 years.
Focus 50% analysis (eg susy) and 50% computing (MG but also AI, simulation...).
# Round table
## AV
Gave ichep talk
Acat abstract has been accepted
## OM
Progress for vectorizing in madevent
Changed the fortran code of the bridge (not the c++),
essentially changed the place where the color is generated
(there are two extra things we pass per event, leading color and helicity).
As input of bridge now pass two random numbers per event,
get back one leading color and one helicity.
So now we need to implement these two algorithms in c++/cuda and we are done!
In parallel also had a look not only at simd/cpus but also gpus.
One thing we need to check is the handling of many processes,
now we throw a randon number and decide which quarks in initial state etc.
For vectorization I do this for 8 events, ok.
For GPUs would like to do this for 32 events (warp size).
AV: ah ok now I understand some previous discussion...
AV/SR: however the kernel launch will be very different!
OM: not necessarily, you can have a single kernel but inside
they trigger different calculations for bunches of 32 events
DISCUSSION, various options.
OM: could also do three different physics processes,
with different phase space integrations.
AV: maybe do this in standalone first?
OM: you would not see it, because standalone gives two different executables.
AV/OM: note each P1 subdirectory is a different executable
OM: the problem here is one single P1 subdirectory (one executable)
has several different matrix elements in the same executable
AV: try to define which physics process
OM: will do, certainly some EFT (split u/d do different things),
but in SM also some VBF (with EW physics)
AV: two other points for GPU
one is nb_page_max vs nb_page_loop
two is profiling the scalar overhead
OM: three is that madevent uses huge ram!
SR: lets focus on doing helicity and color in c++/cuda
## Taylor
since ichep no changes to the kokkos code
walter has been working on moving those changes into the main
nathan did a quick survey of feature changes between gold epochx4 and current code
most issues seem to come in the memory management, which sycl does differently anyway
AV: maybe no need to redo exactly what is done in cudacpp,
try instead to respect the bridge api in fortran
## Carl
nta
## WH
nta
## SH
ntr was on holiday
## SR
followed up on CIPEA project
it looks like we could get a technical student for power consumption of cpu vs gpu
color matrix multiplication on tensor cores
tried to use external libraries like catlas, but this has no double precision
trying to implement it by hand
OM: maybe we do not need to do this in double precision!
SR: very good because i am doing some comparison of physics results
hackathon: we will know by the end of this week or next week if we are accepted
# AOB
next meeting?
OM will be on holiday on august 22 and the whole week
SR not available on 29
ok so meeting on 22 without om