Madgraph5 GPU development

Name: Madgraph5 GPU development
Start: 2022-05-16T15:00:00+02:00
End: 2022-05-16T16:00:00+02:00
Location: Virtual (Zoom)

Monday 16 May 2022, 15:00 → 16:00 Europe/Zurich

Virtual (Zoom)

Videoconference

Madgraph5 GPU development

Zoom Meeting ID: 63368133283
Host: Stefan Roiser
Useful links: Join via phone
Zoom URL

Stefan Roiser

stefan.roiser@cern.ch

+41 75 4115334

Hide

# Monday 16.05.2022 Madgraph dev meeting

Present: SR, AV (notes), SH, OM, TC, CV, NN, WH

## Andrea's slides

Comment/OM: yes working on 340 branch now, this will only be on github,
it is the branch we should give to the experiments

Comment/OM: there are actually three unweighting mechanisms
One is just a strategy not to put too many events on disk [unless you use the fixed grid]
(first you write to /tmp all events passing cuts, then from there you get the maximum)
Second and third one are the hit and miss depending on the maximum

Q/TC: could the E-4 deviation come from rambo being massless?
A/OM/AV: no this is using madevent sampling

Q/TC: could the E-4 deviation be from cuda approximations?
A/AV: no cpp and cuda are similar, they both deviate from fortran

## Taylor

With Nathan we got a gitlab CI up and running with all the latest architectures we use (AMD, Intel GPUs etc).

Also interacting with David about the alpaka implementation.

SR: presentation of ICHEP?
TC: WH will find out by the end of the week if he was invited to give an ATLAS Talk or not
(he will only go if he can give both an ATLAS talk and the madgraph talk)

## Nathan

Shows one slide with sycl/kokkos/cuda on various nvidia/amd/intel cpus/gpus.
Now the three are almost the same on A100, and sycl/kokkos are only ~5% slower than cuda on V100.

Q/AV: have you tried to normalise the Madgraph throughput to something like flops?
This would be interesting to compare for instance A100 to V100 or even to the Intel GPUs.
A/TC: have done something similar with help of hardware expert whoc knows the flops,
actually the normalized throughputs look quite flat (good)

Q/OM: for Skylake CPUs you could plot not only sycl and kokkos but also gcc/c++
(either Olivier's original standalone_cpp or Andrea's cpp without vectorization ie AVX=none)
A/NN: thanks will do

Q/OM: why is iris faster with kokkos while XE-HPSDV is faster with sycl?
Nathan: not clear yet

NN: also in contact with David about alpaka but still having some issues

Q/OM: why is sycl so much faster (2x? more?) than kokkos on skylake-8180?
A/NN: maybe somethinhg to do with vectorization? sycl uses it... and note that 8180 has two fma units
Q/AV: maybe you can try to disable vectorization, it would be interesting
A/NN: yes I can try a 'novec' compiler flag for sycl and see if there is any performance difference

Q/SH: how did you put everything on the gitlab CI?
A/NN: this is our internal gitlab at argonne, we cloned the madgraph repo from github into our CI

## Carl

Made progress in compiling epochX on alpaka. Fixed dependencies of external packages.
Interacted with David and will prepare some updated dcocumentation.

Suggestion: it would be useful to fix a target O/S and a target set of packages.

Could run some performance tests (we have some nvidia t4s), is there a standard protocol?
AV: maybe compare alpaka to default cudacpp on T4? It's a very different machine, so need a couple of points.

## Stephan

Interacted with Andrea about fixing the CI so that it fails if it falls on a GPU with no cuda.

## Olivier

Discussed a few things during Andrea's report.
Focusing on the move to 340.
Will also look at color part, refactorizing that so that the selection can be done in GPU.

## Stefan

Had a discussion with Zenny Wettersten, he will work on a PhD with us.
Two main tasks so far, reweighting, and NLO on GPUs.
TC: we would be interested in the NLO discussion, please include me and NN too.

Myself looking at color matrix and tensor cores on A100, it might make sense.
NN: there was some study of tensor cores at Argonne, could send you some docs.
OM: we have a student looking at simplifying color computations.

SR: max size in tensor cores are 8x8
AV: color algenra is a quadratic form, should chop it up in 8vector times 8x8matrix times 8vector and iterate
OM: advantage?
SR: FMA units (optimised for matrix multiplications)
AV: probably also separate hardware

## AOB

Next meeting: Tue 31 May 3pm (Mon 30 is holiday in the US)

There are minutes attached to this event. Show them.

- 15:00 → 15:10
  
  News 10m
- 15:10 → 15:30
  
  Topical discussion 20m
  
  20220516-MGonGPU-mad-cudacpp-AV-v001.pdf
  
  20220516-MGonGPU-mad-cudacpp-AV-v001.pptx
- 15:30 → 15:50
  
  Round table 20m
- 15:50 → 16:00
  
  AoB 10m

Choose timezone

Madgraph5 GPU development

Virtual (Zoom)

Share this page

Direct link

Social networks

Calendaring