Madgraph5 GPU development

Name: Madgraph5 GPU development
Start: 2024-06-25T16:30:00+02:00
End: 2024-06-25T17:30:00+02:00
Location: CERN

Tuesday 25 Jun 2024, 16:30 → 17:30 Europe/Zurich

513/R-070 - Openlab Space (CERN)

513/R-070 - Openlab Space

CERN

Show room on map

63816708295

Stefan Roiser

Join via phone

Hide

# Madgraph meeting 25 June 2024
https://indico.cern.ch/event/1355155/

Present: SR, DanieleMassaro, AV (minutes), AmeyaThete, OM

## Ameya

AT shows some slides followin up from two weeks ago.
Some variations of ALOHA code result in some improvements, but relatively minor.

AT: as discussed last meeting, our kernels are already quite optimal, based on the roofline plots.

OM: the differences look small, it is difficult to understand how relevant they are.
AV: how reproducible are these numbers? (i.e. what are the fluctuations on these numbers?)
AT: the numbers are already the average of approximately 10 tests for each number
OM: can you give us an idea of the variance however?
AT: have not done that, but could do

AV: as you said, some of these metrcs are difficult to interpret.
The variations of the diffrent code implementations look quite tiny.
Could you maybe present a comparison of these metrics for different processes eg ggtt to ggttggg?
AT: that's possible, should not be too difficult to implement

SR: is this the full madevent or just the standalone?
AT: the full madevent
SR: my point is that if FFV1 is a tiny part of the workflow then maybe the change relative to FFV1 alone is bigger
(assuming that the numbers quoted here are for the full workflow)
AT: yes I can look into this
SR: Daniele and Stephan are doing some work to produce flamegraphs

## Andrea

AV shows some slides.
Many issues piling up... anyway will try to test OLivier's fix and see if it fixes many of these issues

## Olivier

OM prepared an issue with a list of things to do before the release

OM: discussed tests with SR, to be run from "user interface"
OM working on some python framework to add mre tests also to CI/CD
Most (all?) tests are comparing the new version of mg to the old version of MG
AV: very good that we have more tests! :-)

OM Bad news is that out of these tests some are already failing...
One is using different vector sizes.
One worrying test is cross section is different in c++ and fortran

OM some issues may be related to what SR is working on for couplings
AV: can we have a reroducer?
OM: this is actually 826, ie the fact there is a zero cross section

AV: are your tests statistical or bit by bit?
OM: statistical only for the moment

Discussion on tests. More tests is god, many ideas, brainstorming.

AV: one thing to do stil is interface my 'tmad' tests comparing cross sections and lhe files into the CI.

OM: another issue is many tests of standard mg5amcnlo are now failing, spent some time fixing them.
Some issues in loop induced processes, where we do not use cudacpp yet. Progress but work to be done.

## Daniele

Working on a PR with CMS for untarring gridpacks.
THis is https://github.com/mg5amcnlo/mg5amcnlo/pull/107

Also learnt how to run cudacpp version and started producing flamegraphs.
Idea is to understand what else we can bring to gpu, eg lhapdf.

DM reports on problems handling and operating the Madgraph4GPU Makefiles as a newcomer to the project (SR late addition)

SR: SH sent a message to sherpa people, mainly Max and Andy, asking to meet to discuss lhapdf
Not clear if we want to have a kokkos dependency, looking into cuda version
SR: The LHADPDF work is mainly in the nextgen context which in general goes beyond madgraph development (SR late modification)

DM shows some flamegraphs that he just produced
For ggttgg lhapdf is an important part, while the ME seems to have disappeared
The pdf (nnevolvepdf) seems to appear in three sections (two in dsig1_vec under pdg2pdf and rewgt, one in prepare_grouping_choice),
with another two similarly large sections for update_scale_coupling_vec and x_to_f_arg... all this in addition to the MEs.

SR: can you put these flamegraphs on a webpage?
DM: yes will try

DM to OM: if we bring more and more to the GPU, is there a way to understand the steps?
Would be nice to understand iputs and outputs... also to avoid copying data around. Do you have documents, papers etc?
OM: did some of this in the past for AV and SR, will have a look
AV: if you improve lhapdf then the two things that remain are update_scale_coupling_vec and x_to_f_arg:
the first is the computation of Gs, which we said we want to keep in fortran for now,
while x_to_f_arg seems like an excellent candidate to parallelize, well defined inputs and outputs, reentrant, no hidden state.

SR: LHAPDF is a major contributor to CPU usage, let's work on the GPU port of LHAPDF. Once finished we check the status and tackle the next major CPU contributor (SR late addition)

## Stefan

SR have some H100 numbers, but will postpone it next time.

## AOB

Next meeting? In two weeks, will be at CERN on Mon-Tue 9-10 July.

AV: any interaction with the team doing FPGA on Madgraph?
Discussion: no, but could be interesting to invite them for a chat.

There are minutes attached to this event. Show them.

- 16:30 → 16:40
  
  News 10m
- 16:40 → 17:00
  
  Topical discussion 20m
- 17:00 → 17:20
  
  Round table 20m
  
  Speaker: Ameya Thete (University of Wisconsin-Madison (US))
  
  20240625-MG5Dev-H100.pdf
  
  TheteA100_Profiling.pdf
  
  valassi-20240625-MGonGPU-v001.pdf
  
  valassi-20240625-MGonGPU-v001.pptx
- 17:20 → 17:30
  
  AoB 10m