Madgraph5 GPU development

Name: Madgraph5 GPU development
Start: 2024-07-09T09:00:00+02:00
End: 2024-07-09T10:00:00+02:00
Location: CERN

Tuesday 9 Jul 2024, 09:00 → 10:00 Europe/Zurich

513/1-024 (CERN)

513/1-024

CERN

Show room on map

63816708295

Stefan Roiser

Join via phone

Hide

# Madgraph dev meeting Tue 09.07.2024
https://indico.cern.ch/event/1355156/
Present: SR, OM, AV (notes), ZW, DM

## H100 tests (SR ~20')

SR shows some slides

SR got an H100 through NextGen, shows specs, big machine
H100 has double the memory (80GB) as an A100 (40 GB)
And actually one machine has 8 H100s attached

OM: 64-thread warps?
SR: not sure

Slide 7, scan when changing the number of events in input_app.txt
AV: are you also changing the GPU grid size?
SR: no the GPU grid size is fixed here
SR: question for OM, does it make sense to increase numbers in input.txt?

AV: how does this change with channelids?
eg if you have 1000 diagrams this is now 1000 G jobs with 1 channel each
OM: typically will be 100 G jobs with 10 channels each

AV: is I/O affected?
better many jobs with many files with few events,
or few jobs with few files each with many events?
OM: probably does not make much difference

## Flamegraphs and lhapdf (DM ~20')

DM shows some slides
(AV: please attach also a pdf, in case the web site hosting changes)

DM compares flamegraphs of Madgraph with and without lhapdf.
In both cases, pdfs are used, but without lhapdf the internal pdf imolementation of madgraph is used.
The conclusion is that the internal pdf implementation in Madgraph is slower than lhapdf.
AV: then probably for future studies we should ignore the internal pdf and only use lhapdf
DM: yes this makes sense, this was the first time we did this study

SR: I guess the experiments use external lhapdf, maybe we overestimated the time spent in pdfs then
(if we estimated that using the internal pdf in Madgraph)

AV: can we also get improvements by improving HOW pdfs are used?
Sherpa got a factor 40, we cannot get that but maybe some small improvements are possible
OM: they were doing something wrong, and using a badly implemented feature that we do not use

SR: propose that DM continues the work (started with SH) to put pdf on gpu

DM: also propose to profile Madgraph with adaptiveperf by Maks

## Zenny

Refactoring code for reweighting

## Olivier

Discusses status. Three things in progress
- couplings
- warps
- gpucpp360

AV: lets not forget the segfaults and icolamp etc
OM: yes but focus on the future
AV: this was two weeks of hard work, lets not give it for granted...

AV: lets discuss the branches
- AV We agree that we aim to have everything in master? OM yes
- OM the warp stuff is master_june24 AV: should be gpucpp_june24, it is not now
AV: and gpucpp_warp? OM/SR: we can forget
- AV third one? gpucpp_360? OM: yes, with no master
OM but I need june24/warp stuff before merging360
[so idea is gpucpp is main, then june24 should go into it, 360 depends on june24]

DM: so idea is cudacpp as a plugin?
OM: yes

## Andrea

AV shows some slides

Clarification on branches: should have gpucpp_june24

Clarification on iconfig: already testing several channels, even in gpucpp?
To be discussed offline between AV and OM

## AOB

SR looking at coupling ordering. Have some WIP where stop stop was crashing.
Mainly in the python code of cudacpp and mg5amcnlo codegen

AV: never saw a stop stop crash, can you get a reproducer?
SR: three Gs, in one of them it is crashing
AV: thanks, so it is a different iconfig

Next meeting: 23 July?
SR/AV/DM should be ok
ZW will be absent
OM: easier 24-25 rather than 23
SR: then 30 July... ok looks better

There are minutes attached to this event. Show them.

- 09:00 → 09:10
  
  News 10m
- 09:10 → 09:30
  
  Topical discussion 20m
- 09:30 → 09:50
  
  Round table 20m
  
  20240625-MG5Dev-H100.pdf
  
  DM-20240709-Run_CUDA_with-without_LHAPDF.pdf
  
  valassi-20240709-MGonGPU-v001.pdf
  
  valassi-20240709-MGonGPU-v001.pptx
- 09:50 → 10:00
  
  AoB 10m

Choose timezone

Madgraph5 GPU development

513/1-024

CERN