



## Madgraph5\_aMC@NLO on GPUs and vector CPUs: towards production

## The 5-year journey to the first LO release CUDACPP v1.00.00

#### Andrea Valassi (CERN)

on behalf of the MG5AMC CUDACPP development team

CHEP2024, Krakow, 23<sup>rd</sup> October 2024 https://indico.cern.ch/event/1338689/contributions/6015964



## The collaborating teams (summer 2024)

CUDACPP plugin core development (NVidia and AMD GPUs, vectorized C++ on CPUs)



#### Focus of this CHEP2024 talk: <u>first release of the CUDACPP plugin</u>

- will give only minimal details about the parallel work on the SYCL plugin
- will show work done for CMS but see Jin's poster for details on the work in CMS!

SYCL plugin (also Intel GPUs) WIP: integration into CUDACPP



CMS integration tests See Jin's poster for details!

Jin Choi (Seoul National University) Saptaparna Bhattacharya (Wayne State University) Robert Schoefbeck (HEPHY Vienna)



5.



#### Motivation and overview



#### Event generators: the first step in the HEP simulation chain



#### Event generators (1): why accelerate them?



5

#### Sequential processing vs. Data-parallel processing



Note: <u>task parallelism</u> (multi-threading, multi-processing) <u>differs from data parallelism</u>: it exploits a different dimension of hardware parallelism (many CPU cores, many nodes...)

5....

In our work on MG5AMC "CUDACPP" we have targeted data parallelism on <u>both</u> vector CPUs and GPUs from the very beginning!



## Event generators (2): why CPU vectorization and GPUs?

#### Vector CPUs and GPUs are widely available to HEP now...

- All of the CPUs in our computing Grid have SIMD (most have at least AVX2)
- GPUs are becoming more and more available to us especially at HPC centers

#### • ... but they are generally very difficult to exploit in most HEP software 🛞

- Example: Monte Carlo detector simulation has a lot of stochastic branching (makes lockstep processing difficult)

#### However: matrix element event generators are ideal software workflows for SIMD and GPUs!

- Monte Carlo sampling of many data points -> Data parallelism with near-perfect lockstep processing!



See also Andrea Sciabà's plenary talk on Thursday

A. Valassi – Madgraph on GPUs and vector CPUs: towards production

5.

JÜLICH



## What is a MC ME generator? A simplified computational anatomy



(FOR LATER!) Physics output: cross-section and LHE event file

A. Valassi - Madgraph on GPUs and vector CPUs: towards production

5





#### In a nutshell: we speed up the "matrix element" (ME) calculation

- In the old Fortran implementation, the ME calculation was the bottleneck
- Using GPUs/SIMD we speed up MEs so much that previously unimportant components become the bottleneck!
  - Phase space sampling, pdf's, ...

5....

- As predicted by Amdahl's law (later ... )





#### Architecture overview





#### OLD madevent APPLICATION SINGLE-EVENT MEs (< 2020 or < MG 3.6.0)



#### MG5AMC: from single-event to multi-event APIs

## Do's and dont's - two simple lessons learnt for any MC generator

(1) Design computational units using <u>re-entrant functions with well-defined inputs and outputs</u>!

 Beware of hidden inputs and outputs from common blocks and static data...



IMO, within MG5AMC this remains an important issue that complicates the porting to GPU/SIMD of non-ME components like phase space sampling...

- (2) Keep data parallelism in mind from the start: move from single-event APIs to multi-event APIs!
  - -Well-defined input array of many events, well-defined output array of many events



An additional technicality: prefer Structure-of-Array (SOA) memory layouts for the inputs and outputs! [Strictly needed only internally for SIMD and useful for GPUs, but good to have also in the API of the function]

#### If you design a new Monte Carlo from scratch, these are MUST's, not SHOULD's!

**5**...,



#### From single-event to multi-event APIs: some specific examples

• 1. MG5AMC at LO: the work described in this talk!

- This was the work necessary on the madevent Fortran framework (to interface to the CUDACPP "bridge")

- 2. MG5AMC at NLO: the ongoing work described in the next talk by Zenny!
  - The general idea (and possibly the interface of the CUDACPP "bridge") remains the same at NLO as at LO
- 3. **POWHEG + MG5AMC**: the work we plan to collaborate with!
  - This is the work the POWHEG team would need to do on their framework (to interface to the MG5AMC CUDACPP "bridge")





13/30

#### MG5AMC: CUDA/C++, Fortran, bash, python...

Initially (2020-2022) we focused on individual applications

In the last two years (2023-2024) we focused more and more on the full workflow orchestrating many applications - testing/optimizing the sharing of work in many processes - integrating the full user workflow including installation **1. ONE STANDALONE** 2. ONE madevent TOY APPLICATION **APPLICATION** (2020 - 2021)(2022)SERIAL DATA (LOOP) PARALLEL FORTRAN: CUDA / C++: **cuRAND** RANMAR **RANDOM NUMBERS RANDOM NUMBERS** DATA PARALLEL FORTRAN: CUDA / C++: MADEVENT RAMBO MOMENTA MOMENTA DATA DATA PARALLEL PARALLEL CUDA / C++: CUDA / C++: MEKERNELS MEKERNELS MATRIX ELEMENTS MATRIX ELEMENTS 



3. MANY madevent

**APPLICATIONs** 



CHEP2024, Krakow, 23 Oct 2024

14/30

A. Valassi - Madgraph on GPUs and vector CPUs: towards production

#### Test driven development

\*\*\* (2-512z) Compare MADEVENT\_CPP x1 xsec to MADEVENT\_FORTRAN xsec \*\*\*

OK! xsec from fortran (47.138611968034162) and cpp (47.138611968034169) differ by less than 3E-14 (2.220446049250313e-16)

\*\*\* (2-512z) Compare MADEVENT\_CPP x1 events.lhe to MADEVENT\_FORTRAN events.lhe reference (including colors and helicities) \*\*\*

OK! events.lhe.cpp.1 and events.lhe.ref.1 are identical

- I personally think that writing tests is as important as (more important than?) writing implementation code!
- At each stage of development we have been adding new tests and we still run them (manually and/or in the CI)
  - One standalone application: use hardcoded random seeds, compare momenta and MEs to reference files (googletest)
  - One madevent application: use same random seeds, compare cross sections and LHE files for Fortran/C++/CUDA MEs
    - Require ~bit-by-bit equal results (within numerical precision), this is much more than statistical comparisons!
    - This test has been essential for identifying and later fixing a large number of important bugs
  - New (2024): the two tests above are now in the CI for many physics processes, including automatic code generation in the CI
  - Under development: full workflow with many madevent applications, compare overall cross sections and LHE files as above



Test a large phase space of development environments!

- Different physics processes
- Different vectorization scenarios
- Different floating point precisions
- Different compilers and O/S

- ...

Ideally, we will try to port all these ad-hoc manual tests to the CI

15/30

5

#### Some results and new developments





#### Amdahl's law – not a theoretical possibility, we see it all the time!

- The matrix element (ME) calculation was the bottleneck >95% for many processes in Fortran Madgraph -But non-ME part <5% HAS become the bottleneck after we managed to accelerate MEs by factors O(10-1000)!
- Amdahl's law: if the parallelizable part takes a fraction of time p, the maximum speedup is 1/(1-p)

- If the non-ME part takes 5%, the maximum speedup is limited to x20 even when the ME speedup is infinite!



https://en.wikipedia.org/wiki/Amdahl%27s\_law

## Results for $gg \rightarrow t\bar{t}ggg$ from CUDA on NVidia V100 GPUs

|          |                              |           | madevent                                  |                              |               | standalone      |
|----------|------------------------------|-----------|-------------------------------------------|------------------------------|---------------|-----------------|
|          | CUDA grid size               |           |                                           | 16384                        |               |                 |
|          | $gg \rightarrow t\bar{t}ggg$ | MEs       | $t_{\rm TOT} = t_{\rm Mad} + t_{\rm MEs}$ | $N_{\rm events}/t_{\rm TOT}$ | Nevents       | $s/t_{\rm MEs}$ |
|          | 88-11888                     | precision | [sec]                                     | [events/sec]                 | [MEs/sec]     |                 |
| <b>–</b> | Fortran                      | double    | 998.1 = 4.4 + 993.7                       | 8.21E1 (=1.0)                | 8.24E1 (=1.0) |                 |
| P2024    | CUDA/GPU                     | double    | 16.8 = 5.9 + 10.9                         | 4.88E3 (x60)                 | 7.54E3 (x92)  | 9.54E3 (x115)   |
| CHEI     | CUDA/GPU                     | mixed     | 14.3 = 5.7 + 8.6                          | 5.72E3 (x70)                 | 9.49E3 (x115) | 1.16E4 (x141)   |
| AV –     | CUDA/GPU                     | float     | 10.7 = 5.4 + 5.3                          | 7.65E3 (x94)                 | 1.53E4 (x187) | 2.16E4 (x264)   |

2. ONE madevent APPLICATION (2022) FORTRANE RANMMAR RANDOM HUMBERS FORTRANE MADEVENT MADEVENT MADEVENT MADEVENT MADEVENT MADEVENT MADEVENT MADEVENT

Results refer to a single CPU core

Amdahl's law: Overall speedup < ME speedup

Overall speedup x60 (double) and x90 (float) with GPU MEs over scalar Fortran

**ME speedup x90 (double)** and x180 (float) with GPU MEs over scalar Fortran

diagram 1

QCD=5. QED=0

CHEP2024, Krakow, 23 Oct 2024

gg→tt̄ggg subprocess of pp→tt̄+3jets) 1240 Feynman diagrams 120x120 color matrix

18/30

This is a "complex" physics process

5

**Even after GPU acceleration, MEs remain the bottleneck** (11s out of 17s in double precision) Trying to further optimise the ME calculation is still useful to obtain further overall speedups Example: increase the GPU grid size (requires work on Fortran too), smaller kernels, etc...

## Results for $gg \rightarrow t\bar{t}ggg$ from vectorized C++ on Intel Gold CPUs



## Floating point precision $\rightarrow$ constraints on GPU hardware

- Previous slides: if we could use floats instead of doubles, our MEs would be a factor 2x faster!
  - -Our vectorized C++ is 2x faster on CPU (e.g. AVX512: a 512-bit register holds 16 floats but only 8 doubles)
  - -Our CUDA is 2x faster on V100's (on NVidia data-centre GPUs, the FP64 FLOPs are x1/2 the FP64 FLOPs)
- But we need double precision for Feynman diagrams (single precision gives numerical instabilities)
  - This means that we cannot use consumer-grade GPUs (on T4's, the FP64 FLOPs are x1/32 the FP32 FLOPs)
  - Also: GPUs for AI like Blackwell GB200 do have (a lot of!) FP64, but what you pay are FP4 tensor core FLOPs!
    - (En passant: in CUDACPP we do NOT use tensor cores at all these require a different software API than CUDA cores)

| Architecture                                 | ure Blackwell Hopper A |                | Ampere         | Volta          |  |
|----------------------------------------------|------------------------|----------------|----------------|----------------|--|
| Year                                         | 2024                   | 2024 2022      |                | 2017           |  |
| GPU Name                                     | NVIDIA GB200           | NVIDIA H100    | NVIDIA A100    | NVIDIA V100    |  |
| FP64                                         | 90 teraFLOPS           | 34 teraFLOPS   | 9.7 teraFLOPS  | 7.8 teraFLOPs  |  |
| FP32                                         | 180 teraFLOPS          | 67 teraFLOPS   | 19.5 teraFLOPS | 15.7 teraFLOPs |  |
| BF16                                         | N/A                    | 134 teraFLOPS  | 39 teraFLOPS   | N/A            |  |
| t FP16                                       | N/A                    | 134 teraFLOPS  | 78 teraFLOPS   | N/A            |  |
| FP16<br>FP64 Tensor Core<br>TF32 Tensor Core | 90 teraFLOPS           | 67 teraFLOPS   | 19.5 teraFLOPS | N/A            |  |
| TF32 Tensor Core                             | 5 petaFLOPS            | 989 teraFLOPS  | 312 teraFLOPS  | N/A            |  |
| FP16 Tensor Core                             | 10 petaFLOPS           | 1979 teraFLOPS | 624 teraFLOPS  | 125 teraFLOPs  |  |
| D BF16 Tensor Core                           | 10 petaFLOPS           | 1979 teraFLOPS | 624 teraFLOPS  | N/A            |  |
| FP8 Tensor Core                              | 20 petaFLOPS           | 3958 teraFLOPS | N/A            | N/A            |  |
| FP4 Tensor Core                              | 40 petaFLOPS           | N/A            | N/A            | N/A            |  |

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper https://resources.nvidia.com/en-us-blackwell-architecture

5.



We did switch to floats where possible – "mixed-precision": double for Feynman, float for color matrix

 This is the default in CUDAPP v1.00.00 (even if the speedup over double is limited and still to be improved)
 We also had a closer look at the source of numerical instabilities with the CADNA tool



## **Beyond NVidia GPUs**



- The CUDACPP plugin uses a single source-code approach for CPUs (C++) and NVidia GPUs (CUDA), based on #ifdef's
   The *few CUDA calls* are encapsulated by design in GPU classes
  - We do not use any vendor-specific features (e.g. Streams) yet
- CUDACPP v1.00.00 includes support for AMD GPUs through HIP, using the same #ifdef approach
  - This was inspired by the LHCb approach to GPUs
  - NVidia and AMD provide ~80% of the GPU power in top500 HPCs
- Our Argonne colleagues are working on extending this to Intel GPUs via SYCL (based on their earlier work on a SYCL plugin)

| Accelerated Supers,        | Peak    |            | Total     |             |            |       |
|----------------------------|---------|------------|-----------|-------------|------------|-------|
| <u>By Accelerator Type</u> | Systems | Share      | Teraflops | Share       | Cores      | Share |
| AMD MI200                  | 11      | 5.7%       | 2,499,680 | 26.5%       | 12,757,568 | 25.3% |
| AMD MI300                  | 3       | 1.6%       | 96,294    | 1.0%        | 387,072    | 0.8%  |
| Intel GPU Max              | 4       | 2.1%       | 2,067,806 | 22.0%       | 9,613,760  | 19.0% |
| Nvidia Al00                | 83      | 43.0%      | 1,462,714 | 15.5%       | 9,754,044  | 19.3% |
| Nvidia H100                | 29      | 15.0%      | 2,379,348 | 25.3%       | 6,599,000  | 13.1% |
| Nvidia Other               | 60      | 31.1%      | 899,135   | 9.5%        | 10,994,436 | 21.8% |
| Other                      | 3       | 1.6%       | 12,870    | 0.1%        | 360,120    | 0.7%  |
| Total                      | 193     |            | 9,417,847 |             | 50,466,000 |       |
| All Supers 500             |         | 12,499,181 |           | 114,650,780 |            |       |
| Accelerator Share          | 38.6%   |            | 75.3%     |             | 44.0%      |       |
| CPU-Only Share             | 61.4%   |            | 24.7%     |             | 56.0%      |       |

5

https://www.nextplatform.com (13 May 2024)

GPU Teraflops by GPU family (HPCs in June 2024 Top500)
NVidia GPUs 50.3%
AMD GPUs 27.5%

- Intel GPUs 22.0%



A. Valassi – Madgraph on GPUs and vector CPUs: towards production



#### A. Valassi – Madgraph on GPUs and vector CPUs: towards production

float

Beyond NVidia GPUs: results with AMD GPUs at LUMI

|                             |           |                                           | madevent                     |                              |
|-----------------------------|-----------|-------------------------------------------|------------------------------|------------------------------|
| $gg \rightarrow t\bar{t}gg$ | MEs       | $t_{\rm TOT} = t_{\rm Mad} + t_{\rm MEs}$ | $N_{\rm events}/t_{\rm TOT}$ | $N_{\rm events}/t_{\rm MEs}$ |
| $gg \rightarrow hgg$        | precision | [sec]                                     | [events/sec]                 | [MEs/sec]                    |
| Fortran(scalar)             | double    | 26.6 = 1.4 + 25.2                         | 3.09E3 (=1.0)                | 3.25E3 (=1.0)                |
| C++/none(scalar)            | double    | 33.2 = 1.4 + 31.2                         | 2.47E3 (x0.8)                | 2.58E3 (x0.8)                |
| C++/sse4(128-bit)           | double    | 16.9 = 1.4 + 15.5                         | 4.85E3 (x1.6)                | 5.28E3 (x1.6)                |
| C++/avx2(256-bit)           | double    | 8.1 = 1.4 + 6.7                           | 1.01E4 (x3.3)                | 1.22E4 (x3.8)                |
| HIP/GPU                     | double    | 2.9 = 1.8 + 1.1                           | 2.88E4 (x9.3)                | 7.69E4 (x24)                 |
| C++/none(scalar)            | mixed     | 33.2 = 1.4 + 31.8                         | 2.47E3 (x0.8)                | 2.57E3 (x0.8)                |
| C++/sse4(128-bit)           | mixed     | 16.7 = 1.4 + 15.3                         | 4.91E3 (x1.6)                | 5.36E3 (x1.6)                |
| C++/avx2(256-bit)           | mixed     | 8.3 = 1.4 + 6.9                           | 9.93E3 (x3.2)                | 1.20E4 (x3.7)                |
| HIP/GPU                     | mixed     | 2.9 = 1.8 + 1.1                           | 2.88E4 (x9.3)                | 7.69E4 (x24)                 |
| C++/none(scalar)            | float     | 32.1 = 1.4 + 30.8                         | 2.55E3 (x0.8)                | 2.66E3 (x0.8)                |
| C++/sse4(128-bit)           | float     | 9.2 = 1.4 + 7.8                           | 8.92E3 (x2.9)                | 1.05E4 (x3.2)                |
| C++/avx2(256-bit)           | float     | 4.9 = 1.4 + 3.5                           | 1.69E4 (x5.5)                | 2.37E4 (x7.3)                |

2.4 = 1.8 + 0.7

gg→tt̄gg (subprocess of pp→tt̄+2jets) 123 Feynman diagrams 24x24 color matrix



Overall speedups for  $gg \rightarrow t\bar{t}gg$ 

- x9.3 for MEs on an AMD Instinct MI200 GPU

- x3.2 for MEs on an AMD 7A53 CPU with AVX2 (Amdahl: maximum overall speedup is x20)

One limitation: was unable to build HIP code for the more complex  $gg \rightarrow t\bar{t}ggg$  process...

We kindly acknowledge the use of LUMI HPC resources under project 465001114 ("CERN / HEPiX Benchmarking GPU WP" EHPC-BEN-2024B04-053) to produce these results

1.20E5 (x37)

3.36E4 (x11)

CHEP2024, Krakow, 23 Oct 2024



22/30



(**5**) A.

CHEP2024

≥

HIP/GPU

## Beyond Standard Model physics in MG5aMC CUDACPP



- CUDACPP v1.00.00 includes support for several BSM processes (at LO): SUSY, HEFT, SMEFT
- Motivation: speed up large productions of BSM processes at LO (for many SM processes NLO is required)
   Need event samples exploring a large parameter space (by event generation or by event reweighting)
- Technical challenge (with respect to SM): non-standard parameters and couplings
  - debugged/fixed non-standard code to propagate the running of the QCD coupling  $\alpha_s$  to these BSM parameters and couplings
  - (reminder: for each event,  $\alpha_s$  scale is computed in Fortran and passed to cuda/c++ that computes  $\alpha_s$ -dependent parameters)





## Studies with CMS: understanding Drell-Yan+3jets speedups (1)

- CMS have been the first early adopters of CUDACPP an extremely useful, mutually beneficial, collaboration!
   See the details of all the studies performed by/within CMS in Jin Choi's poster
- One of many issues we discussed with CMS: what is the speedup we can achieve from cudacpp in DY+jets?

 $g\bar{u} \rightarrow \tau^+ \tau^- gg\bar{u}$ (subprocess of DY+3j) 100 Feynman diagrams 6x6 color matrix



5.

|                                                  |           |                                           | madevent                     |                              |
|--------------------------------------------------|-----------|-------------------------------------------|------------------------------|------------------------------|
| $g\bar{u} \rightarrow \tau^+ \tau^- g g \bar{u}$ | MEs       | $t_{\rm TOT} = t_{\rm Mad} + t_{\rm MEs}$ | $N_{\rm events}/t_{\rm TOT}$ | $N_{\rm events}/t_{\rm MEs}$ |
| (81920 weighted events)                          | precision | [sec]                                     | [events/sec]                 | [MEs/sec]                    |
| Fortran(scalar)                                  | double    | 52.2 = 17.0 + 35.2                        | 1.57E3 (=1.0)                | 2.32E3 (=1.0)                |
| C++/none(scalar)                                 | double    | 50.9 = 17.0 + 33.9                        | 1.61E3 (x1.0)                | 2.42E3 (x1.0)                |
| C++/sse4(128-bit)                                | double    | 35.5 = 17.0 + 18.5                        | 2.31E3 (x1.5)                | 4.44E3 (x1.9)                |
| C++/avx2(256-bit)                                | double    | 24.5 = 16.9 + 7.6                         | 3.34E3 (x2.1)                | 1.08E4 (x4.7)                |
| C++/512y(256-bit)                                | double    | 23.9 = 16.9 + 7.0                         | 3.43E3 (x2.2)                | 1.17E4 (x5.0)                |
| C++/512z(512-bit)                                | double    | 26.7 = 17.0 + 9.6                         | 3.09E3 (x2.0)                | 8.57E3 (x3.7)                |
| CUDA/GPU                                         | double    | 17.6 = 17.4 + 0.3                         | 4.65E3 (x3.0)                | 3.26E5 (x140)                |
| C++/none(scalar)                                 | mixed     | 50.9 = 16.9 + 33.9                        | 1.61E3 (x1.0)                | 2.41E3 (x1.0)                |
| C++/sse4(128-bit)                                | mixed     | 33.9 = 16.9 + 17.0                        | 2.41E3 (x1.5)                | 4.82E3 (x2.1)                |
| C++/avx2(256-bit)                                | mixed     | 24.8 = 17.2 + 7.6                         | 3.31E3 (x2.1)                | 1.08E4 (x4.7)                |
| C++/512y(256-bit)                                | mixed     | 24.1 = 17.1 + 7.0                         | 3.40E3 (x2.2)                | 1.18E4 (x5.0)                |
| C++/512z(512-bit)                                | mixed     | 26.5 = 17.0 + 9.6                         | 3.09E3 (x2.0)                | 8.57E3 (x3.7)                |
| CUDA/GPU                                         | mixed     | 17.7 = 17.4 + 0.3                         | 4.64E3 (x3.0)                | 3.23E5 (x138)                |
| C++/none(scalar)                                 | float     | 50.1 = 16.9 + 33.1                        | 1.64E3 (x1.0)                | 2.47E3 (x1.1)                |
| C++/sse4(128-bit)                                | float     | 26.3 = 16.9 + 9.4                         | 3.11E3 (x2.0)                | 8.70E3 (x3.7)                |
| C++/avx2(256-bit)                                | float     | 20.8 = 16.9 + 3.9                         | 3.94E3 (x2.5)                | 2.12E4 (x9.1)                |
| C++/512y(256-bit)                                | float     | 20.6 = 16.9 + 3.6                         | 3.99E3 (x2.5)                | 2.27E4 (x9.7)                |
| C++/512z(512-bit)                                | float     | 21.7 = 16.9 + 4.8                         | 3.78E3 (x2.4)                | 1.71E4 (x7.3)                |
| CUDA/GPU                                         | float     | 17.6 = 17.4 + 0.2                         | 4.66E3 (x3.0)                | 4.46E5 (x191                 |

For one typical subprocess of DY+3jets: Fortran MEs ~ 67% of the total time => Max overall speedup is x3 (Amdahl)

Achieved speedups (mixed FP precision): - x3.0 on GPU (NVidia V100)

- x2.2 on SIMD ("512y" on Intel Silver)





24/30

⋛

## Studies with CMS: understanding Drell-Yan+3jets speedups (2)

- Non-ME is 33% of overall Fortran time (max speedup x3): what exactly takes time? => Profiling!
  - The answer is: Fortran phase space sampling is now the bottleneck for DY+3j (93s out of 177s overall with C++ MEs)

| pp_dy3j.mad//fortran/output.txt                         | Overally                      | pp_dy3j.mad//cpp512z | /output.txt              |             |
|---------------------------------------------------------|-------------------------------|----------------------|--------------------------|-------------|
| [GridPackCmd.launch] GRIDPCK TOTAL 447.7169 seconds     | Overall:                      | [GridPackCmd.launch] | GRIDPCK TOTAL 176.889    | 1 seconds   |
| [madevent COUNTERS] PROGRAM TOTAL 443.48                | - Fortran 447s                | [madevent COUNTERS]  | PROGRAM TOTAL 172.637    |             |
| [madevent COUNTERS] Fortran Other 6.5439                | - C++ 177s (speedup x2.5)     | [madevent COUNTERS]  | Fortran Other 6.5768     |             |
| [madevent COUNTERS] Fortran Initialise(I/O) 4.4648      |                               | [madevent COUNTERS]  | Fortran Initialise(I/O)  | 4.486       |
| [madevent COUNTERS] Fortran Random2Momenta 93.2692      | Matrix elements:              | [madevent COUNTERS]  | Fortran Random2Momenta 9 | 3.2907      |
| [madevent COUNTERS] Fortran PDFs 8.2697                 | - Fortran 307s                | [madevent COUNTERS]  | Fortran PDFs 8.2998      |             |
| [madevent COUNTERS] Fortran UpdateScaleCouplings 7.3142 | - C++ 36s <i>(speedup x8)</i> | [madevent COUNTERS]  | Fortran UpdateScaleCoupl | ings 7.2827 |
| [madevent COUNTERS] Fortran Reweight 3.6975             |                               | [madevent COUNTERS]  | Fortran Reweight 3.7045  |             |
| [madevent COUNTERS] Fortran Unweight(LHE-I/O) 4.8636    | Non-ME (overall - ME):        | [madevent COUNTERS]  | Fortran Unweight(LHE-I/O | ) 4.8719    |
| [madevent COUNTERS] Fortran SamplePutPoint 8.3255       | - (Same for Fortran/C++)      | [madevent COUNTERS]  | Fortran SamplePutPoint 8 | .2892       |
| [madevent COUNTERS] Fortran MEs 306.731                 | - Total non-ME 140s           | [madevent COUNTERS]  | CudaCpp Initialise 0.361 | 9           |
| [madevent COUNTERS] OVERALL NON-MEs 136.748 FORTRAN     | - Phase space sampling 93s    | [madevent COUNTERS]  | CudaCpp Finalise 0.0221  |             |
| [madevent COUNTERS] OVERALL MEs 306.731                 |                               | [madevent COUNTERS]  | CudaCpp MEs 35.4557      | C++         |
| AV – preliminary                                        |                               | [madevent COUNTERS]  | OVERALL NON-MEs 137.181  | AVX512      |

AV – preliminary!

- Profiling above is via code instrumentation (complementary to perf/flamegraph sampling profiling)
  - Inserted low-overhead rdtsc counters in madevent internal fortran calls (still being tuned!)
  - Keep, parse, aggregate all madevent logs from many processes in the python/bash orchestrator
  - Not yet merged in mg5amcnlo or cudacpp but might be added in upcoming versions



25/30

[madevent COUNTERS] OVERALL MEs 35.4557

## Reweighting

5

1. Generate signal sample at  $\theta_{ref}$ , with  $w_i(\theta_{ref})=1$  (By definition, background does not depend on  $\theta$ )

2. Full detector simulation (MC truth event properties  $\mathbf{x}_{i}^{(true)} \rightarrow observed$  event properties  $\mathbf{x}_{i}$ )

3. Reweight each event by matrix element ratio

 $w_i(\theta) = \frac{\operatorname{Prob}_{(\theta)}(\mathbf{x}_i^{(\operatorname{true})})}{\operatorname{Prob}_{(\theta_{\operatorname{ref}})}(\mathbf{x}_i^{(\operatorname{true})})} = \frac{|\mathcal{M}(\theta, \mathbf{x}_i^{(\operatorname{true})})|^2}{|\mathcal{M}(\theta_{\operatorname{ref}}, \mathbf{x}_i^{(\operatorname{true})})|^2}$ 





- Two advantages: lower costs (no detector simulation), fewer statistical fluctuations – Interest in CMS for EFT studies (exploration of large space of model parameters)
- In practice for MG5AMC: read in LHE file, add weights, write back modified LHE file
  - Reweighting using the new ME engine in CUDA/C++ provides O(x10) speedups! <
  - Zenny's "tREX" is essentially ~ready to be included in an upcoming CUDACPP v1.01

One further possible application: weight derivatives in parameter measurements

- My suggestion: save weight derivatives (w.r.t measured parameter) in LHE files
- Use them to compute ideal measurement error or for weight derivative regression
- See https://doi.org/10.1051/epjconf/202024506038 and https://zenodo.org/records/11120823



0i

Zenny Wettersten

Number of jets

Зi

## A tale of two repositories

#### Our work of the last 5 years is mainly here!



• For more details:

- Our wiki: https://github.com/madgraph5/madgraph4gpu/wiki/Working-with-cudacpp-v1.00.00-(October-2024)
- Our bi-weekly development meetings: https://indico.cern.ch/category/12586



#### The long journey to the v1.00.00 CUDACPP release – some steps



#### Conclusions

**5**..,

- The first release of the MG5AMC CUDACPP plugin for LO processes has been delivered!
- We speed up the matrix element (ME) calculation via data parallelism with excellent lockstep processing
  - On NVidia and AMD GPUs in the order of x100 to x1000
  - On vector CPUs we achieve the theoretical speedup limit of x8 for AVX512 SIMD in double precision
- We achieved overall speedups between x3 and x70 (for DY+3j and gg→tt
  ggg) depending on the physics process
   – For many processes (like DY+3j) we speed up the ME so much that the bottleneck is the non-ME component
- Many opportunities for further speedups, especially in the non-ME components (phase space sampling, PDFs...)

   WIP to profile the largest residual bottlenecks and to identify speedup strategies (with/without data parallelism)
- We need double precision (FP64) in the majority of floating-point operations
  - Using single precision (FP32) would allow a further speedup of MEs by a factor ~x2 but leads to numerical instabilities
  - WARNING: in the latest GPUs targeting AI like NVidia Blackwell, what you pay for is mainly FP4/FP8/FP16 instead!
- Mutually beneficial collaboration with CMS to test our software in production-like environments – See the details in the poster by Jin Choi on Thursday
- The techniques and software used in the LO release were designed with NLO in mind and most can be reused!
  - See the details in the next talk by Zenny Wettersten
  - More generally, two lessons learnt for any MC: use reentrant functions with well-defined inputs/outputs, use multi-event APIs

29/30

#### **Thanks Krakow!**

Thanks to the organizers of this conference



Conference on Computing in High Energy and Nuclear Physics

Thanks Staszek and the whole KORALW/YFSWW team for everything I learnt from you about Monte Carlo's





CHEP2024, Krakow, 23 Oct 2024

30/30

# BACKUP SLIDES



CHEP2024, Krakow, 23 Oct 2024 31/30

#### Benchmarking GPUs (CUDA cores or tensor cores? Which FP precision?)

On data-centre NVidia GPUs, in CUDA cores the FP32 FLOPs are 2x the FP64 FLOPs

| Architecture     | Blackwell     | Hopper         | Ampere         | Volta          |
|------------------|---------------|----------------|----------------|----------------|
| Year             | 2024          | 2022           | 2020           | 2017           |
| GPU Name         | NVIDIA GB200  | NVIDIA H100    | NVIDIA A100    | NVIDIA V100    |
| FP64             | 90 teraFLOPS  | 34 teraFLOPS   | 9.7 teraFLOPS  | 7.8 teraFLOPs  |
| FP32             | 180 teraFLOPS | 67 teraFLOPS   | 19.5 teraFLOPS | 15.7 teraFLOPs |
| BF16             | N/A           | 134 teraFLOPS  | 39 teraFLOPS   | N/A            |
| FP16             | N/A           | 134 teraFLOPS  | 78 teraFLOPS   | N/A            |
| FP64 Tensor Core | 90 teraFLOPS  | 67 teraFLOPS   | 19.5 teraFLOPS | N/A            |
| TF32 Tensor Core | 5 petaFLOPS   | 989 teraFLOPS  | 312 teraFLOPS  | N/A            |
| FP16 Tensor Core | 10 petaFLOPS  | 1979 teraFLOPS | 624 teraFLOPS  | 125 teraFLOPs  |
| BF16 Tensor Core | 10 petaFLOPS  | 1979 teraFLOPS | 624 teraFLOPS  | N/A            |
| FP8 Tensor Core  | 20 petaFLOPS  | 3958 teraFLOPS | N/A            | N/A            |
| FP4 Tensor Core  | 40 petaFLOPS  | N/A            | N/A            | N/A            |

Blackwell GPUs do have many more FP64 FLOPs in CUDA cores than Hopper or Ampere or Volta...

#### ...but what you really pay for in Blackwell GPUs are the FP4/FP8/FP16 tensor core FLOPs for Al!

NB: CUDA cores and tensor cores are two different types of processors on the same chip! You must develop two different types of software! (In MG5AMC we do not use tensor cores yet...)

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper https://resources.nvidia.com/en-us-blackwell-architecture

A. Valassi – Madgraph on GPUs and vector CPUs: towards production

- CHEP2024

¥



#### Speeding up PDF's





(5

CHEP2024, Krakow, 23 Oct 2024

33/30

#### Vectorizing phase space sampling?



I had a first quick look at possibly vectorizing sample\_get\_x

- these are relatively short functions with simple operations, it is not rocket science
  - API: could start by preparing baskets and then looping internally as Olivier did for MEs
- the main problem I see is that there are many COMMON's making this stateful
  - can the hidden inputs/outputs requiring these COMMON's be avoided?
  - or can these hidden inputs/outputs be moved outside the event/particle loop?



Andrea Valassi – CMS DY+jets, timers/profiling, first sampling improvements

20 August 2024

6



A. Valassi – Madgraph on GPUs and vector CPUs: towards production



#### Low hanging fruits in sample\_get\_x



# Low hanging fruits?

<u>#946</u> Status: WIP PR exists, to be rediscussed with Olivier

| 53s (down by 30%)                                                                                                                             | 946b. (issue 968)                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| 135.1144         130.8140s         53.0338s         for       44652395         events         35.4908s       for         1769472       events | "More controversial" changes?<br>Save an additional 30% |
| 95.3232s<br>35.4908s for 1769472 events                                                                                                       |                                                         |
| luding trivial improvements #969<br>78s (down by 15%)                                                                                         | 946a. (issue 969)                                       |
| 160.1718                                                                                                                                      | "Non-controversial" changes?                            |
| 155.8605s<br>78.1023s for 44652395 events                                                                                                     | Ŭ                                                       |
| 70.10255 IOL 44052575 events                                                                                                                  | Save 15%                                                |
| 35.4320s for 1769472 events                                                                                                                   |                                                         |
| 120.4290s<br>35.4320s for 1769472 events                                                                                                      |                                                         |
| improvements<br>176.8891                                                                                                                      |                                                         |
| <u>172.6370s</u>                                                                                                                              |                                                         |
| 93.2907s for 44651014 events                                                                                                                  |                                                         |
| 35.4557s for 1769472 events<br>137.1806s                                                                                                      |                                                         |
| 35.4557s for 1769472 events                                                                                                                   |                                                         |
| m/valassi/madgraph4gpu/commit/348664                                                                                                          | c66d90f47d1d9e6fd72d7dd7f4b0fa7cff                      |
|                                                                                                                                               |                                                         |

A. Valassi – status of PRs (plus CMS/DY, timers/profiling, sampling...)

https://github.com/valassi/m

27 August 2024 18



< ..

< [madevent COUNTERS]

< [madevent COUNTERS]

[madevent COUNTERS] OVERALL MES

CudaCpp MEs

OVERALL NON-MES



#### Issue #2 *Data-parallel paradigms* (GPUs and vectorization)

Generators lend themselves naturally to exploiting event-level parallelism via **data-parallel paradigms**\*\*

- **SPMD**: Single Program Multiple Data (GPU accelerators)
- SIMD: Single Instruction Multiple Data (CPU vectorization: AVX...)
- <u>The computationally intensive</u> part, the matrix element  $f(\vec{x}_i)$ , is <u>the same function for all events i</u> (in a given category of events)
- Unlike detector simulation (where if/then branches are frequent and lead to thread divergence on GPUs)

Potential interest of GPUs

- Faster (cheaper?) than on CPUs
- Exploit GPU-based HPCs WIP for MG5aMC on GPUs



\*Note for software engineers: these calculations do involve some linear algebra, but "matrix element" does not refer to that! Here we <u>compute one "matrix element" in the S-matrix (scattering matrix)</u> for the transition from the initial state to the final state

\*\*This simple event-level parallelism can also be used as the basis for task-parallel approaches (multi-threading or multi-processing)



A. Valassi – MC generators challenges and strategy towards HL-LHC

LHCC – 01 Sep 2020

9



A. Valassi – Madgraph on GPUs and vector CPUs: towards production





### • The challenge: attracting, training, retaining people with the right competencies and interests

- -Can we attract and motivate young theorists to work on software and computing optimizations?
  - A theorist colleague I was recently talking to: "We had an opening for working on software optimizations for our Monte Carlo generator. The only suitable candidates were two theorists. But they were concerned that working on software optimizations would harm their future careers as theorists and refused the job. In the end, we did not hire anyone."
- -Can we attract and motivate *young software engineers* to work with us instead of tech or finance companies?

### I am only reporting a problem here... I do not have a magic-wand solution 8

### • A big lesson learnt from porting MG5AMC to GPUs: you need collaborations with a mix of skills!



• Developing Monte Carlo generator software: which kind of job is this? In which box should it be?

Scientific Computing and Software Collaborations







## Helicity amplitudes – same code in CUDA and in vectorized C++



- Old slide! The new code is different, the idea is the same!
- Formally the same code for CUDA and scalar/vector C++

hide type behind a typedefadd a few missing operators

SIMD in CUDA/C++ uses compiler vector extensions!

Flexible design: being reused also in
the vectorized SYCL implementation
typedef sycl::vec<fptype, MGONGPU\_MARRAY\_DIM> fptype\_sv;



CHEP2024, Krakow, 23 Oct 2024



## C++ vectorization in CUDACPP: overview

- Implementation is based on compiler vector extensions (CVEs): explicit vectors of floating point types
  - Supported by all of the gcc, clang and (through clang) Intel icpx compilers
  - Powerful but easy to use (no debugging auto-vectorization!), intuitive (they force you to design code for vector types!)



• Routinely build and compare five vectorization levels on Intel CPUs (and similar features on AMD or ARM CPUs)





## MG5aMC data parallelism: design for lockstep processing!

- In MC generators, the same function is used to compute the Matrix Element for many different events
  - -<u>ANY</u> matrix element generator is a good fit for lockstep processing on GPUs (SIMT) and vector CPUs (SIMD)
  - -Data parallelism strategy in madgraph4gpu is event-level parallelism (many events = many phase space points)



5



## Lockstep? MC generators (*lucky!*) vs MC detector simulation (unlucky)

- Monte Carlo methods are based on drawing (pseudo-)random numbers: a dice throw
- From a software workflow point of view, these are used in two rather different cases:



41/30

LA ROULETTE DE MONTE-CARLO Règle du Jei

## ME throughput in C++ for $gg \rightarrow t\bar{t}gg$ (on all the cores of a CPU)

ggttgg check.exe scalability on "bmk6130" (2x 16-core 2.1GHz Xeon Gold 6130 with 2x HT) for 10 cycles



- Large SIMD speedups are also confirmed when all CPU cores are used
  - AVX512/zmm speedup of x16 over no-SIMD for a single core slightly decreases to ~x12 on a full node (clock slowdown?)
  - Overall speedup on 32 physical cores (over no-SIMD on 1 core) is around 280 (maximum would be 16x32=512)
  - Aggregate MEs throughput from many identical processes using the standalone application
    - (HEP-workload Docker container from the HEPIX Benchmarking WG)

(5

(This addresses the question by Liz earlier this afternoon)

### Our internal Fortran-to-C++ interface: multi-event and stateless!

```
С
```

C Execute the matrix-element calculation "sequence" via a Bridge on GPU/CUDA or CUDA/C++.

- C PBRIDGE: the memory address of the C++ Bridge
- C MOMENTA: the input 4-momenta Fortran array
- C GS: the input Gs (running QCD coupling constant alphas) Fortran array
- C RNDHEL: the input random number Fortran array for helicity selection
- C RNDCOL: the input random number Fortran array for color selection
- C CHANID: the input Feynman diagram to enhance in multi-channel mode if 1 to n (disable multi-channel if 0)
- C MES: the output matrix element Fortran array
- C SELHEL: the output selected helicity Fortran array
- C SELCOL: the output selected color Fortran array

```
С
```

### INTERFACE

SUBROUTINE FBRIDGESEQUENCE(PBRIDGE, MOMENTA, GS,

& RNDHEL, RNDCOL, CHANID, MES, SELHEL, SELCOL)
INTEGER\*8 PBRIDGE
DOUBLE PRECISION MOMENTA(\*)
DOUBLE PRECISION GS(\*)
DOUBLE PRECISION RNDHEL(\*)

```
DOUBLE PRECISION RNDCOL(*)
```

INTEGER\*4 CHANID

```
DOUBLE PRECISION MES(*)
```

INTEGER\*4 SELHEL(\*)

INTEGER\*4 SELCOL(\*)

END SUBROUTINE FBRIDGESEQUENCE

END INTERFACE

This outputs the squared sum of amplitudes (real number)

As discussed with Simon, for HERWIG and other generators it may be useful to also expose an API that gives the partial amplitude (complex number) for a given colour structure



## Reweighting and weight derivatives in parameter estimation

• Weight derivative: event-by-event sensitivity to the measured parameter



First: makes it possible to determine the limit error with an ideal detector, and how much (0 to 1) we do worse

 with a given luminosity at a FCC-ee, what is the best theoretically achievable measurement on Higgs couplings?

## Knowing one's limits: maximum achievable information with an ideal detector

- Ideal acceptance, select all signal events S<sub>sel</sub>=S<sub>tot</sub>

- Ideal resolution, measured  $\gamma_i$  is that from MC truth (implies ideal rejection of background events,  $\gamma_i=0$ )

$$\mathcal{I}_{\theta}^{(\text{ideal})} \!=\! \sum_{i=1}^{N_{\text{tot}}} \gamma_i^2 \!=\! \sum_{i=1}^{S_{\text{tot}}} \gamma_i^2$$



• Second: can be used as a basis for an "improved optimal observable" ML method



https://doi.org/10.1051/epjconf/202024506038 https://zenodo.org/record/3715951



CHEP2024, Krakow, 23 Oct 2024





## Memory layouts – AOS, SOA, AOSOA

Matrix element calculation (simplified example)

- inputs[4\*Npar\*Nevt] = (x,y,z,E)-momentum of Npar particles for Nevt events (n-dim array, substructure)
 - outputs[Nevt] = matrix element for Nevt events (1-dim array, no substructure)

Example: Npar=6 particles for the 2 $\rightarrow$ 4 process gg $\rightarrow$ ttrgg

We have experimented with three possible memory layouts for momenta (1) Array-of-Structures **AOS**: momenta[Nevt][Npar][4]

- (2) Structure-of-Arrays SOA: momenta[Npar][4][Nevt]
   (3) AOSOA: momenta[Npag][Npar][4][Nepp] with Nevt = Npag ("pages") \* Nepp ("events per page")
   We are using AOSOA's as the current default but this is still largely configurable
- For CPU vectorization, AOSOAs (or SOAs) are absolutely mandatory!

-We use an AOSOA with Nepp equal to the SIMD vector size NeppV – and an aligned *malloc* is needed too! -For performance comparison we also build a no-SIMD mode with Nepp=1, which is effectively an AOS

- For GPUs (1 event per thread), AOSOAs are faster (fewer memory accesses) but not strictly necessary

   We use Nepp=4(8) for doubles(floats) so that each page is 32 bytes (the "sector" size, or L2 cache line size)
   For a given number of "requests", AOS uses 4 times more "sectors" (transactions) than AOSOA with Nepp=4
- Coding for SIMD is more complex than coding for GPUs...

## Monitoring GPU memory access – NSight Compute

- Explicitly collect two relevant profiler metrics in NSight Compute
  - -"requests" : I1tex\_t\_requests\_pipe\_lsu\_mem\_global\_op\_ld.sum
  - -"sectors" (i.e. transactions, network roundtrips): *l1tex\_t\_sectors\_pipe\_lsu\_mem\_global\_op\_ld.sum*
  - -this is from old tests in August 2020 (issue #16), the profiler metrics names may have changed since then

| 🚯 eemumuAV_cu_0814_1726_b16384_t32_j12_BASELINE.ncu-rep × 🚯 eemumuAV_cu_0814_1725_b16384_t32_j12_SOA.ncu-rep × 🚯 eemumuAV_cu_0814_1721_b163 | 32_112_AOS.nau-rep ×                                         |
|---------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
| Page: Details 👻 Launch: 3 - 502 - sigmaKin 💌 🍸 👻 Add Baseline 👻 Apply Rules                                                                 | Copy as Image                                                |
| Current 502 - sigmaKin (16384, 1, 1)x(32, 1, 1) Time: 632.13 usecond Cycles: 776,713 Regs: 152 GPU: Tesla V100-PCIE-32GB SM Frequency:      | cycle/nsecond CC: 7.0 Process: [22259] gcheck.exe            |
| BASELINE ASA 502 - sigmaKin (16384, 1, 1)x(32, 1, 1) Time: 584.90 usecond Cycles: 716,813 Regs: 152 GPU: Tesla V100-PCIE-32GB SM Frequency: | cycle/nsecond CC: 7.0 Process: [22731] gcheck.exe            |
| ▼ Command line profiler metrics                                                                                                             | o 1                                                          |
| litex_t_requests_pipe_lsu_mem_global_op_ld.sum [request] 1,527,808 (+0.00%) litex_t_sec                                                     | pipe_lsu_mem_global_op_ld.sum [sector] 39,753,533 (+290.86%) |

- Profile AOS against the AOSOA baseline
  - -same number of "requests" in AOS and AOSOA
  - -AOS needs 4 times as many "sectors" as AOSOA (which fits 4 doubles in a 32-byte cache line)
  - -in other words: AOSOA provides coalesced memory access, AOS does not
  - -for what it is worth (not much!), the actual slowdown in this  $e^+e^- \rightarrow \mu^+\mu^-$  example was only 7% however

## Inside the ME calculation: Feynman diagrams, colors, helicities

$$|\mathcal{M}|^{2}(\vec{p}) = \sum_{\lambda \in \{\text{hel}\}} \left[ \sum_{c \in \{\text{col}\}} \left| \sum_{d \in \{\text{diag}\}} (\mathcal{M}^{d}_{\lambda}(\vec{p}))^{(c)} \right|^{2} \right]$$

Given the momenta  $\vec{p}$  of initial+final partons in one specific event Sum over all helicity combinations  $\lambda$  of initial+final partons Sum over all color combinations c of initial+final partons Include all Feynman diagrams d allowed for the given  $\lambda$  and c

### In practice in MG5aMC: use helicity amplitudes and QCD color decomposition

1. (for each helicity  $\lambda$ ) compute partial amplitudes J<sup>f</sup> for each color ordering permutation f (sum diagrams relevant to f)

$$(J_{\lambda}(\vec{p}))^f = \sum_{d \in \{\text{diag}\}} (\mathcal{M}^d_{\lambda}(\vec{p}))^f$$

Example for  $gg \rightarrow t\bar{t}ggg$ : 1240 Feynman diagrams (using helicity amplitudes) This takes ~40% of the CPU time for this process

2. (for each helicity  $\lambda$ ) compute the sum over colors as the quadratic form JCJ\* using the constant color matrix C

$$|\mathcal{M}|^{2}(\vec{p}) = \sum_{\lambda \in \{\text{hel}\}} \left[ \sum_{f,g} (J_{\lambda}(\vec{p}))^{f} (C)^{fg} (J_{\lambda}^{*}(\vec{p}))^{g} \right]$$

Example for  $gg \rightarrow t\bar{t}ggg$ : 120 color ordering permutations, 120x120 matrix This takes ~60% of the CPU time for this process

3. Sum over helicities [Example for  $gg \rightarrow t\bar{t}ggg$ : 128 helicities (before and after filtering)]

### Each step computes many events $\vec{p}$ in parallel! CPU: 1 SIMD event-vector at a time. GPU: 1 event per thread.



### C++ vectorization – why choose Compiler Vector Extensions?

typedef fptype fptype\_v \_\_attribute\_\_ ((vector\_size (neppV\*sizeof(fptype))));

- Portable available in gcc, clang, icpx (from clang) with minimal differences

   Do not require any external libraries or tools (VC, VCL, VecCore, xSIMD, UME::SIMD, or SYCL...)
- Powerful, but easy to use
  - No need to debug auto-vectorization when it does not vectorize
  - -As powerful as intrinsics, but much easier to write (higher-level abstractions)
- Intuitive CVEs force you to think in terms of vector types!
- Minor disadvantage no vector complex type out of the box

   But it was easy to write it in our case (RRRRIIII memory layout) as we only need + × ÷
   A few extensions for Boolean vector masks were needed, too
- One technical detail: we malloc a standard (aligned!) fptype\* and reinterpret\_cast as fptype\_v\*...

### HUGE THANKS TO SEBASTIEN PONCE for his Practical Vectorization lectures mentioning CVEs!





## Monitoring lockstep – GPU NSight compute, CPU disassemble

- GPU: explicitly collect one profiler metric in NSight Compute
  - "branch efficiency" : sm\_\_sass\_average\_branch\_targets\_threads\_uniform.pct
  - -old test (May 2021 issue #25) comparing two code bases: no-divergence baseline has 100% efficiency, alternative with minor forced divergence has 96% efficiency (and is 20% slower)

| 🕹 eemumuAV_cu_05                | 513_1108_b2048_t256_i1_prof2default.ncu-re | p 🗙 🛍 eemumuAV_cu_0513_1107_b2048_t256_i1_prof2   | divergent.ncu-rep $	imes$   |                                                                             |           |                 |
|---------------------------------|--------------------------------------------|---------------------------------------------------|-----------------------------|-----------------------------------------------------------------------------|-----------|-----------------|
| <u>P</u> age: Details ¬         | Launch: 4 - 519 - sigmaKin                 | ▼ ▼ ▼ Add Baseline ▼ Apply <u>R</u> ules          |                             |                                                                             | α         | opy as Image 💌  |
| Current                         | 519 - sigmaKin (2048, 1, 1)x(256, 1, 1     | 1) Time: 476.93 usecond Cycles: 592,229 Regs: 128 | GPU: NVIDIA Tesla V100S-PCI | E-32GB SM Frequency: 1.24 cycle/nsecond CC: 7.0 Process: [12414] gcheck.e   |           | $\oplus \Theta$ |
| NO_DIVERGEN                     | ICE 519 - sigmaKin (2048, 1, 1)x(256, 1, 2 | 1) Time: 373.63 usecond Cycles: 467,720 Regs: 120 | GPU: NVIDIA Tesla V100S-PCI | E-32GB SM Frequency: 1.25 cycle/nsecond CC: 7.0 Process: [12636] gcheck exe |           |                 |
|                                 |                                            |                                                   |                             |                                                                             |           |                 |
| <ul> <li>Command lin</li> </ul> | ie profiler metrics                        |                                                   |                             |                                                                             |           | Q               |
| litex_t_reques                  | sts_pipe_lsu_mem_global_op_ld.sum [rec     | uest]                                             | 917,504 (+40.00%)           | litex t sectors pipe isu mem global op id.sum [sector]                      | 7.339.411 | (+40.00%)       |
| launch_registe                  | ers_per_thread [register/thread]           |                                                   | 128 (+6.67%)                | <pre>smsass_average_branch_targets_threads_uniform.pct [thread]</pre>       | 96.33     | (-3.67%)        |

- CPU: the best lockstep metric IMO is the speedup over a no-SIMD case (reach theoretical maximum!)
  - -but is also useful to disassemble the object using objdump and categorize SIMD intrinsics symbols...

|            | # Symbols in .o | SSE4.2 | AVX2<br>(ymm) | AVX512<br>(ymm) | AVX512<br>(zmm) |
|------------|-----------------|--------|---------------|-----------------|-----------------|
| <b>b</b> 0 | Build type      | (xmm)  |               |                 |                 |
| ≯tīgg      | Scalar          | 4534   | 0             | 0               | 0               |
| <u>88</u>  | SSE4.2          | 12916  | 0             | 0               | 0               |
|            | AVX2            | 0      | 10630         | 0               | 0               |
| 4a90ec2    | 256-bit AVX512  | 0      | 10366         | 12              | 0               |
| 49         | 512-bit AVX512  | 0      | 1267          | 60              | 9910            |

|                             | A  | CAT2022   | made                         | vent   |
|-----------------------------|----|-----------|------------------------------|--------|
| $gg \rightarrow t\bar{t}gg$ |    | MEs       | $N_{\rm events}/t_{\rm MEs}$ |        |
|                             |    | precision | [MEs                         | s/sec] |
| Fortran(scalar)             |    | double    | 2.30E3                       | (=1.0) |
| C++/none(scalar)            |    | double    | 2.28E3                       | (x1.0) |
| C++/sse4(128-bit            | )  | double    | 4.62E3                       | (x2.0) |
| C++/avx2(256-bit            | t) | double    | 1.05E4                       | (x4.6) |
| C++/512y(256-bi             | t) | double    | 1.16E4                       | (x5.0) |
| C++/512z(512-bit            | t) | double    | 1.91E4                       | (x8.3) |



A. Valassi – Madgraph on GPUs and vector CPUs: towards production

## Code generation: from many "epochs" to a single evolving "epoch" ... and beyond



A. Valassi - Madgraph on GPUs and vector CPUs: towards production

.....5

CHEP2024, Krakow, 23 Oct 2024

## What about loops? And how many are N events?

• You will still need to loop over multiple sets of N events

-And the internal implementation of N-event processing may still involve some loops!



NB: I focus on event-level parallelism, but other options exist
 In MG5AMC we will investigate using 1 GPU thread per helicity per event...

A. Valassi – Madgraph on GPUs and vector CPUs: towards production

**5**...,

CHEP2024, Krakow, 23 Oct 2024

### Why focus on complex processes? Compute >> memory!

### CUDA: Host(CPU)-to/from-Device(GPU) data copy has a cost

- In our standalone application (all on GPU): momenta, weights, MEs D-to-H - Plots below from Nvidia Nsight Systems: 12 iterations with 524k events in each iteration
- Eventually, MadEvent on CPU + MEs on GPU: momenta H-to-D; MEs D-to-H
- The time cost of data transfers is relatively high in simple processes

- ME calculation on GPU is fast (e.g.  $e^+e^- \rightarrow \mu^+\mu^-$ : 0.4ms ME calculation ~ 0.4ms ME copv) • Note: our ME throughput numbers are ( number of MEs ) / ( time for ME calculation + ME copy )



- We are lucky: the more complex the physics process, the less relevant is the cost of **GPU-CPU** data copies!
  - -Similarly (later): the more complex the process, the less relevant is the overhead from scalar Fortran in madevent!
  - And the fewer events in flight needed to fill the GPU...
- In this talk I mainly give performance numbers for complex processes like  $gg \rightarrow t\bar{t}gg$  Or  $gg \rightarrow t\bar{t}ggg$

## Some ideas for heterogeneous processing



Throughput variation as a function of GPU grid size (#blocks \* #threads)

This is the number of events processed in parallel in one cycle

### To further reduce the relative overhead of the scalar Fortran MadEvent - parallelize it on many CPU cores?

- Blue curve: one single CPU process using the GPU
  - For  $gg \rightarrow t\bar{t}gg$ , you need at least ~16k events to reach the throughput plateau
- Yellow, Green, Red curves: 2, 4, 8 CPU processes using the GPU at the same time
  - Fewer events in each GPU grid are needed to reach the plateau if several CPU processes use the GPU
  - The total Fortran RAM would remain the same, but the CPU time in the Fortran overhead would be reduced
  - (Why total throughput increases beyond the nCPU=1 plateau is not understood yet!...)

5

CHEP2024, Krakow, 23 Oct 2024

### Lockstep beyond event-level parallelism

- Efficient data parallelism (lockstep processing) requires the same function computed for different data
  - -This is true in MG5AMC at the *event level* (different events i.e. different phase space points)
  - -But it is also true at the *sub-event level* (different helicities within the same event)
- We are evaluating the move to a different data parallelism strategy on GPUs
  - -Currently: one event (sum over all helicities) per GPU thread
  - In the future: one helicity of one event per GPU thread?

$$|\mathcal{M}|^{2}(\vec{p}) = \sum_{\lambda \in \{\text{hel}\}} \left[ \sum_{f,g} (J_{\lambda}(\vec{p}))^{f} (\mathcal{C})^{fg} (J_{\lambda}^{*}(\vec{p}))^{g} \right]$$

$$(J_{\lambda}(\vec{p}))^{f} = \sum_{d \in \{\text{diag}\}} (\mathcal{M}_{\lambda}^{d}(\vec{p}))^{f}$$

- Advantages:
  - -You can fill the GPU with much fewer "events in flight" more balanced sampling/integration in MadEvent
  - -This is a prerequisite for moving the color matrix to externally-launched cuBLAS and tensor cores
  - -This is also a prerequisite if we want to evaluate much smaller kernels
    - From all Feynman diagrams in one kernel to one Feynman diagram per kernel?
    - Which might decrease register pressure and increase kernel occupancy, but would require more global memory access

## **CUDACPP MEs**



- 95% common code + a few #ifdef's for CUDA vs C++
- Designed for NVidia GPUs (so far: will add HIP/AMD)
   Full feature support, e.g. tensor cores, streams, graphs



- Designed upfront for SIMD speedups on vector CPUs
   Intel®AVX512
- WIP on CPU multithreading and heterogeneous modes

### PFMEs alsaka Kokkos (SYCL.

- Write code once for many CPU/GPU vendors
- Support NVidia, AMD and Intel GPUs out-of-the-box – Limited support for vendor-specific features



• SIMD added via SYCL in Jan 2023, analysing results

• CPU multithreading out of the box

For the moment: we plan to continue development in parallel using both approaches – comparisons are very useful! Two goals: not only production releases, but also *aim to provide useful feedback to HEP about usability of PFs* 

CHEP2024, Krakow, 23 Oct 2024 55/30

### CUDACPP vs. Portability Frameworks - recap

- CUDAPP (our initial implementation) is where we add new features first
- The SYCL implementation of MG5aMC is now almost at the same level, the KOKKOS one somewhat behind
- The ALPAKA implementation of MG5aMC is no longer maintained

| Backend           | ME code generation | Standalone application | Actively maintained | MadEvent application | Latest dev code base |
|-------------------|--------------------|------------------------|---------------------|----------------------|----------------------|
| CUDACPP           | $\checkmark$       | $\checkmark$           | ✓                   | <b>~</b>             | $\checkmark$         |
| SYCL              | ✓                  | ✓                      | ✓                   | ✓                    | ~ ✓                  |
| KOKKOS            | $\checkmark$       | ✓                      | ~ ✓                 | WIP                  | WIP                  |
| ALPAKA<br>(CUPLA) | ✓                  | ✓                      | ×                   | ×                    | ×                    |



## CUDACPP vs PFs - GPU ME throughputs (standalone application)



- The performances of the SYCL and Kokkos implementations of MG5aMC seem comparable to direct CUDA Further comparisons are in progress, performance scales differently with more jets for different backends (next slide)
- SYCL and Kokkos run out of the box also on AMD and Intel GPUs
  - They also run out of the box on CPUs (performance under investigation)

Xe-HP is a software development vehicle for functional testing only - currently used at Argonne and other customer sites to prepare their code for future Intel data centre GPUs XE-HPC is an early implementation of the Aurora GPU

(5.



## CUDA vs SYCL on NVidia A100

PRELIMINARY! N. Nichols, T. Childers (SYCL) J. Teig (tests/plots)

- SYCL and CUDA implementations have ~similar performances but –SYCL seems better for less complex processes
  - CUDA seems better for more complex processes
- These are very recent results, which are still being digested (WIP!)

   It will be very interesting to understand more in detail what goes on

We plan to also compare more systematically the CUDACPP and SYCL performances on CPUs (vectorization, multi-core), but it will take time and optimization tweaks... WIP!



CHEP2024, Krakow, 23 Oct 2024 🏌

# MORE BACKUP SLIDES



CHEP2024, Krakow, 23 Oct 2024 59/30

### Code generation: how did we bootstrap the project?

### Code is auto-generated $\Rightarrow$ Iterative development process

- User chooses process, MG5aMC determines Feynman diagrams and generates code
  - Currently Fortran (default), C++, or Python
  - The more particles in the collision, the more Feynman diagrams and the more lines of code



| Process                         | LOC  | functions | function calls |  |
|---------------------------------|------|-----------|----------------|--|
| $e^+e^- \rightarrow \mu^+\mu^-$ | 776  | 8         | 16             |  |
| $gg \rightarrow t\bar{t}$       | 839  | 10        | 22             |  |
| $gg \rightarrow t\bar{t}g$      | 1082 | 36        | 106            |  |
| $gg \rightarrow t\bar{t}gg$     | 1985 | 222       | 786            |  |





A. Valassi – Madgraph on GPUs and vector CPUs: towards production

5

CHEP2024, Krakow, 23 Oct 2024

MADGRAPH

FIRST

C++ COD



## Filling the GPU – minimum number of threads (events in flight)



- We are lucky, again: the more complex the process, the fewer the events in flight needed to fill the GPU
- But even 16k events is a lot: it results in imbalanced phase space sampling, and high RAM in Fortran
  - -Eventually, maybe: one helicity per kernel (fewer events in flight, spread each event across many kernels)?
  - -Eventually, maybe: many CPU cores/processes in parallel (fewer events in flight per CPU core/process)?
  - -Eventually, maybe: different channels in parallel (fewer events in flight in a single channel)?

5

### THE CADNA LIBRARY

- Computers sometimes lie about floatingpoint numbers
- CADNA is a library with special floatingpoint types to measure precision and instabilities in C++ and Fortran
- Each number knows its current precision
- CADNA counts unstable operations

### • See <u>seminar at CERN</u>

STEPHAN HAGEBOECK - MG5 AMC MEETING 202

5

https://indico.cern.ch/event/1264290/

S. Hageboeck, Gargnano meeting 18 Sep 2023

▶ P(x,y)= 9x<sup>4</sup> - y<sup>4</sup> + 2y<sup>2</sup>

### Without CADNA:

#### With CADNA:

P(10864,18817) = @.0 (exact value: 1) P(1/3,2/3) = 0.802469135802469E+000

0 UNSTABLE DIVISION(S) 0 UNSTABLE POWER FUNCTION(S) 0 UNSTABLE MULTIPLICATION(S) 0 UNSTABLE BRANCHING(S) 0 UNSTABLE INTRINSIC FUNCTION(S) 2 UNSTABLE CANCELLATION(S)

## Numerical precision: CADNA (can we use floats instead of doubles?)

### F. Optolowicz, CERN EP-SFT meeting 21 Aug 2023



• Application to MG5AMC CUDACPP:

- assess precision of the ME calculation (when using floats: down to 3 significant digits in gg to ttggg)
- understand where in the code the precision is lost (typically, cancellations subtracting large terms, one example being heavily suppressed helicities)

CHEP2024, Krakow, 23 Oct 2024

### All MadEvent functionalities have been integrated over time

Most of these required some changes to the input/output API of our Fortran-to-CUDA/C++ "Bridge"

- *Helicity filtering* at initialization time, compute the allowed combinations of particle helicities -This is computed in CUDA/C++ using the same criteria as in Fortran
- *"Multi-channel"* single-diagram enhancement of ME output
  - -This is the specificity of the MadEvent sampling algorithm (<u>Maltoni Stelzer 2003</u>)  $f_i = \frac{|A_i|^2}{\sum |A_i|^2} |A_{tot}|^2$
- Event-by-event running QCD coupling constants  $\alpha_s(Q^2)$ 
  - -The scale is currently computed in Fortran from momenta and passed to the CUDA/C++ for each event
- Event-by-event *choice of helicity and color* in LHE files
  - -Pass two additional random numbers per event from Fortran to CUDA/C++, retrieve helicity and color
  - -NEW (January 2023)! This was the last big missing physics functionality (showstopper to a release)
    - We now get the same cross section AND the same LHE files (within numerical precision) in Fortran and CUDA/C++

### Benchmarking – Madgraph and the HEP-SCORE project

- HEPscore: the new HEP benchmark for compute resources, replacing HepSpec06
  - -Based on reproducible HEP workloads (GEN, SIM, DIGI, REC...) within docker containers
  - -The first version HEPscore23 should become production in April 2023 for (x86 and ARM) CPUs
- The aim is to *benchmark a fully loaded server:* all CPU cores, and eventually all associated GPUs
  - -(and ideally measure how well an application is doing compared to the theoretical power of the server...)
  - -fill all CPU cores by a combination of application multi-threading and/or several identical copies/processes
- A first container based on our Madgraph-on-GPU has been prepared
  - -Very useful because it gives the same physics results on CPU and GPU: may compare them to each other!
  - -And eventually may be used to evaluate heterogeneous processing on CPU+GPU...
- The plots on the next slides are based on this HEPscore container: several identical copies/processes

   (A multi-threaded CUDACPP version exists but not optimized yet SYCL and Kokkos also provide MT)

**~**5...,

CHEP2024, Krakow, 23 Oct 2024

## MG5AMC is not alone – SHERPA on GPU (BlockGen)



Figure 7: The timings for various GPU-based algorithms are compared as a function of gluon multiplicity. All algorithms were run on an NVIDIA V100 (16 GB global memory, 5,120 CUDA cores, 6144 KB L2 cache).

From http://dx.doi.org/10.21468/SciPostPhysCodeb.3

More recent results were presented in June 2023 in Les Houches by Max Knobbe

- Note: unlike MG5aMC, based on Feynman diagrams, SHERPA uses ~Berends-Giele recursion relations
   Allows computations with more final-state jets
- No ongoing effort on CPU vectorization (yet)
- Planned Les Houches project: a detailed comparison of *software performances* of MG5AMC and SHERPA
  - -Tentative process list: pp to tt(0-3jets) or Z(0-3jets)
  - -Previously, an old wish of the HSF generator WG
  - -(NB: not a comparison of physics results or distributions)



# EVEN MORE BACKUP SLIDES

A. Valassi – Madgraph on GPUs and vector CPUs: towards production



## Multi-channel



What do we do if there is no transformation that aligns all integrand peaks to the chosen axes? Vegas is bound to fail!

Solution: use different transformations = channels

$$p(x) = \sum_{i=1}^n \alpha_i p_i(x) \qquad \text{with} \qquad \sum_{i=1}^n \alpha_i = 1$$

with each  $p_i(x)$  taking care of one "peak" at the time

$$I = \int dx \ f(x) = \sum_{i=1}^{m} \alpha_i \int \frac{f(x)}{p(x)} dP_i(x),$$

Mílan 2020

OM, mgongpu dev meeting 22 Jun 2020

10

### Mattelaer Olívíer

5



CHEP2024, Krakow, 23 Oct 2024



## Additional information tracking

- Leading-Color information
  - Needed for starting point of the shower
  - → Can be memory heavy
  - Really needed
  - Not easy to include as post-processing
- Helicity information
  - → Not really needed (not provided at NLO)
    - CMS request it (use post-processing for NLO)
  - Easy for post-processing

OM, mgongpu dev meeting 22 Jun 2020

21

Mattelaer Olívíer

Mílan 2020



CHEP2024, Krakow, 23 Oct 2024



## 3. Github Actions

- Actions file in .github/workflows/c-cpp.yml Ο
- If you have C++-only job, can Ο hook yourself into that
- WIP: CUDA job on CERN Ο premises
  - Already ran CUDA tests in Ο container
- Development done, waiting for Ο review: PR#52
- Will apply same strategy to Ο epoch2/cuda

| name: C/C++ CI                               |                            |   |
|----------------------------------------------|----------------------------|---|
| on:                                          |                            |   |
| pull_request:                                |                            |   |
| branches: [ master ]                         |                            |   |
| jobs:                                        |                            |   |
| epoch1_eemumu:                               |                            |   |
| runs-on: ubuntu-latest                       |                            |   |
| defaults:                                    |                            |   |
| run:                                         |                            |   |
| working-directory:                           |                            |   |
| epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sn | 1_epem_mupmum              |   |
| steps:                                       |                            |   |
| - uses: actions/checkout@v2                  |                            |   |
| - name: make gtest                           |                            |   |
| working-directory: tools                     |                            |   |
| run: make                                    |                            |   |
| - name: make                                 |                            |   |
| run: make                                    |                            |   |
| - name: make check                           | SH, mgongpu<br>dev meeting |   |
| run: make check                              | 30 Nov 2020                | 8 |



CHEP2024, Krakow, 23 Oct 2024



### Simultaneous development for CUDA and C++ Aug 2020

### Design principle: same source code for CUDA and C++

- –Why? Large overlap from the start (CUDA derived from standalone C++)
- -Why? Avoid divergence and the need to duplicate effort
- –Why? Because some optimizations (e.g. AOSOA) may be good for both!
- -How? Use #ifdef's more than 90% of the code remains common

### Design principle: decouple floating-point data types from calculations

- –Why? Allow switching from double to float (in both CUDA and C++)
- -How? Typedef, default being "typedef double fptype;"

### Design principle: decouple data access from calculations

-Why? Allow using formally the same code in CUDA and C++ • Note: C++ is not vectorized yet here, data types are the same (scalar) in both -Why? In the past, also allowed SOA/AOS and global/shared/local comparison -How? Use references and constant references in compute engines -How? Move data access to separate routines, build references from them

const fptype& plparlp4levt( const fptype\* momenta1d, const int ipar, const int ip4, const int ievt )



A. Valassi - Madgraph vectorisation

MG4GPU development meeting – 14 Dec 2020

OLD

3

(master)



A. Valassi – Madgraph on GPUs and vector CPUs: towards production

CHEP2024, Krakow, 23 Oct 2024



### **AOSOA** memory layouts for CUDA and C++



### Design principle: move from variable-size to fixed-size arrays

- -Why? The cardinalities of all relevant data sets (#particles, #momenta...) are fixed!
- -How: replace all std::vector<std::vector<xxx>> by C-style arrays xxx[][]
  - This alone gave a nice performance boost to both CUDA and C++ (IIRC...)

### Vague design intuition: achieve SOA or AOSOA memory layouts

- -Why? It may help for coalesced memory access in CUDA (eventually it did)
- -Why? It may help for SIMD vectorization in C++ (eventually it did)
- -How? C-style arrays: e.g. SOA, "double momenta[4=xyzE][#particles][#events]"
  - Note: this can also be cast from/to "double momenta[]" i.e. "double\* momenta"
- -How? Many iterations, each with nevt events (=ndim=#threads\*#blocks on GPU)
  - All arrays are dimensioned to handle the nevt events in one iteration
- -How? AOSOA eventually favored over SOA, split nevt into "pages"
  - Hence nevt=npag\*nepp (#pages \* #events per page), yielding AOSOA[npag][...][nepp]
  - More manageable than SOA[...][npag\*nepp], better performance in CUDA
  - Note: can always recover AOS[nevt][...][1] by setting npag=nevt and nepp=1



CHEP2024, Krakow, 23 Oct 2024



### SIMD vectorization (1) – event loops



### • Design principle: make the event loop the innermost loop

-Why? Event-level data parallelism: the same calculations apply to all events

- -How? Some refactoring, e.g. move helicity loops outside the event loop
  - Note: involves rethinking the interfaces of C++ methods ~analogous to CUDA kernels – Hence the name of the branch "klas": kernel launchers and SIMD vectorization
- -Note: our CUDA implementation is also based on event-level data parallelism
  - But in CUDA we *already* achieved event-level parallelism even without these changes
    GPU SIMT/SPMD parallelism is simpler to achieve than CPU SIMD vectorization





MG4GPU development meeting – 14 Dec 2020



CHEP2024, Krakow, 23 Oct 2024

5



#### SIMD vectorization (2) – interfaces and types



#### Implementation choice: compiler vector extensions (CVE)

- -Why? No external dependency (VC, VecCore...), simpler than intrinsics
  - More predictable than auto-vectorization alone, but still very lightweight
- –Why? Available for gcc, but also (with some limitations) for clang/OpenCL and icc
  - My initial tests: may need intrinsics if we want to support clang, operator[] is not a ref
- –How? Floating point vector  $\underline{fptype v}$  (RRRR) is a CVE typedef of size neppV
  - "typedef fptype\_v \_\_attribute\_\_ ((vector\_size (neppV\*sizeof(fptype))));" // RRRR
- -How? Complex vector <u>cxtype\_v</u> (RRRIIII) is a class with two fptype\_v vectors "class cxtype\_v{ ...; private: fptype\_v m\_real, m\_imag; }; // RRRRIII
  - Add boilerplate implementation for missing "cxtype[v] operator cxtype\_v|fptype[v]"
  - No ambition to be complete: only add what is needed (as we do for cxtype's in CUDA)
- -How? Refactor interfaces of C++ compute engines using vector types
  - The nevt=npagV\*neppV events are split in npagV pages with neppV events per page
  - Explicit loop on npagV in the calling function, implicit loop on neppV in CVE vectors
  - This is where most of the actual coding work was necessary



A. Valassi – Madgraph vectorisation

MG4GPU development meeting – 14 Dec 2020 6





#### SIMD vectorization (3) – formal language



- Design principle: decouple data access and data types from calculations
  - -Why? Allow using formally the same code in CUDA and scalar or vectorized C++
  - -How? Add another typedef which can be scalar or vector depending on context





A. Valassi – Madgraph vectorisation

MG4GPU development meeting – 14 Dec 2020





#### **SIMD** vectorization (4) – architectures



Tested various combinations of compiler flags and architectures

 AVX2 gives almost a factor 4 speedup from SIMD vectorization!

Close to maximum theoretical limit (vectors of 4 doubles)

-AVX512 is not faster (or is even a bit slower) than AVX2

• Not too surprising, there are known issues - may investigate more later if needed

#### Support choice: focus on AVX2, support scalars, AVX512 and SSE

-Reminder: scalar=1, SSE=2, AVX2=4, AVX512=8 doubles per vector

-Why? AVX2 gets the highest gain (4x), keep AVX512 anyway for further tests

- -Why? Most modern CPUs have AVX2, but some Grid nodes only have SSE
  - I had initially discarded SSE, will add support for it as agreed

-How? Hardcode vector sizes depending on #ifdef"s for AVX2 and AVX512F



A. Valassi – Madgraph vectorisation

MG4GPU development meeting – 14 Dec 2020 8





## Tests on Cori – First success on AVX512/zmm

- See <a href="https://github.com/madgraph5/madgraph4gpu/pull/236">https://github.com/madgraph5/madgraph4gpu/pull/236</a>
  - Results from Friday evening not yet merged
- CORI at NERSC Intel Xeon Gold 6148 CPUs plus V100 GPUs
  - Temporary access during a training about CUDA Streams (thanks ATLAS!)

#### First demonstrated benefit of AVX512/zmm over AVX512/ymm

- For doubles: around 10%
- For floats: around 30%
- To be compared to a theoretical factor 2 maximum speedup
- Things I have not tested
  - perf was not available
  - I forgot to compare with/without LTO/inlining (results above are no LTO, no inlining)



AV – WIP on Madgraph4GPU

20 Jul 2021





## **Towards production: two development lines?**

- Fastest: "standalone" C++ driver
- Allows fastest prototyping – minimal overhead, ~pure ME
- Complete minimal eemumu anyway

   unweighting, cross section
- Freedom to redesign from scratch

   new sampling e.g. VegasFlow
   GPU specific e.g. PDFFlow

- For production: madevent Fortran driver
- Keep the framework used by users

   and all complexities (NLO, pdf...)
- Two basic options
  - 1. Rewrite full madevent Fortran in C++, keep the Python machinery
    - MadEvent input/output?
    - Complex, slow, unnecessary?
  - 2. <u>Inject only C++/CUDA ME</u> <u>calculation into Fortran madevent</u>
    - Seems feasible and ~fast
    - Small apple-to-apple replacement (and timing comparison)



5

AV – WIP on Madgraph4GPU

#### 22 Mar 2021 12



# Fortran: integration proposal details

- Analysis (so far) in <a href="https://github.com/madgraph5/madgraph4gpu/issues/121">https://github.com/madgraph5/madgraph4gpu/issues/121</a>
- Two main problems:

5

- 1. Fortran matrix1 routine is not completely stateless: must clean up inputs/outputs
- 2. Fortran matrix1 operates on a single event: must rewrite the flow for many events
  - Absolutely needed both for GPUs and for SIMD on CPUs
- Can proceed in two steps
  - 1. Make Fortran stateless, inject single-event C (C++/CUDA), test cross-linkage with data
  - 2. Split the Fortran event loop that calls matrix1 (via dsig)

|                                              | 124        |                                                                                       |                |
|----------------------------------------------|------------|---------------------------------------------------------------------------------------|----------------|
|                                              | 125        | c Main Integration Loop                                                               |                |
|                                              | 120        | iter = 1                                                                              |                |
|                                              | 128        | do while(iter .le. itmax)                                                             |                |
|                                              | 129        | c                                                                                     |                |
| Split here:                                  | 130        | c Get integration point                                                               |                |
|                                              | 131        | c                                                                                     |                |
| - for many events, call x to f arg (random + | 132        | <pre>call sample_get_config(wgt,iter,ipole)</pre>                                     |                |
|                                              | 133        | if (iter .le. itmax) then                                                             |                |
| sampling), store the corresponding momenta   | 134        | <pre>ievent=ievent+1 call x_to_f_arg(ndim,ipole,mincfig,maxcfig,ninvar,wgt,x,p)</pre> |                |
| - loop over the stored momenta, call dsig in | 136        | if (pass_point(p)) then                                                               |                |
| - loop over the stored momenta, call usig in | 137        | <pre>fx = dsig(p,wgt,0) !Evaluate function</pre>                                      |                |
| parallel for many events (must remove state  | 138        | wgt = wgt*fx                                                                          |                |
| •                                            | 139        | <pre>if (wgt .ne. 0d0) call graph_point(p,wgt) !Update graphs</pre>                   |                |
| also from dsig, not only from matrix1)       |            | else                                                                                  |                |
| alloo noni doig, not only non mathematic     | 141        | fx =0d0                                                                               |                |
|                                              | 142        | wgt=0d0<br>endif                                                                      |                |
|                                              | 143        | endit<br>call sample_put_point(wgt,x(1),iter,ipole) !Store result                     |                |
| Allocate memory in advance, before that      |            | endif                                                                                 |                |
|                                              | 145<br>146 | if (wgt .ne. 0d0) kevent=kevent+1                                                     |                |
|                                              | 147        | c                                                                                     |                |
|                                              | 148        | c Write out progress/histograms                                                       |                |
|                                              | 149        | c                                                                                     |                |
|                                              | 150        | if (kevent .ge. nwrite) then                                                          |                |
| à                                            | 151        | <pre>nwrite = nwrite+ncall*itmax/nsteps nwrite = min(nwrite,ncall*itmax)</pre>        |                |
| ERN ALL MUD on Madaronh (CDL)                | 152        | <pre>call graph_store</pre>                                                           | 22 Mar 2024 4/ |
| AV – WIP on Madgraph4GPU                     | 154        | endif                                                                                 | 22 Mar 2021 14 |
|                                              | 155        | 99 enddo                                                                              |                |
|                                              |            |                                                                                       |                |

A. Valassi – Madgraph on GPUs and vector CPUs: towards production



#### Interfacing with Fortran (MadEvent): potential challenges

- Work just starting: inject our CUDA/C++ ME calculation into MadEvent
  - Fastest way to bring this R&D work to production discussed with experiments in HSF WG
  - More than just MadEvent and Fortran: bash, Python, C++ and pdf, PS, NLO merging...
  - Easier validation leverage on established infrastructure, no change in user interface
- From a first look at MadEvent: two potential challenges (legacy code reengineering)
  - (1) must create event baskets a posteriori (current code loops on individual events)
  - (2) Fortran common blocks complicate separation of inputs and outputs (not pure functions)





A. Valassi – Reengineering Madgraph5\_aMC@NLO for GPUs and vector CPUs

vCHEP – 19 May 2021 21



# Old design

#### dsample.f

call x\_to\_f\_arg(...) -> x, p, wgt (local) -> xbk, q2fact, cm\_rap (common)

- fx = dsig(p,wgt,0)
- -> lot is hidden here!
- -> cuts
- -> matrix-element choice
- -> symmetry handling
- -> scale choice

call sample\_put\_point()
-> add grid running information
-> decide to pass to next iteration
-> update the grid

#### auto\_dsig.f

- SELPROC(...)=DSIGPROC(...)
- -> Compute PDF only for all flavor
  - Pick a given matrix-element
  - Pick a symmetry channel
- -> run the cuts (inside dsigproc)

Reset clustering flag

DSIG=DSIGPROC()

- -> re-run cuts
  - -> update model
- -> compute matrix-element
- -> compute the pdf
- -> in principle cached

OM, mgongpu dev meeting

17

7 Jun 2021

Mattelaer Olívíer

HSF 2021





# New design



A. Valassi - Madgraph on GPUs and vector CPUs: towards production



# Edit metadata \_\_init\_\_.py

#### File Edit Options Buffers Tools Python Help import the required files example: import maddm\_interface as maddm\_interface # local file import madgraph.various.cluster as cluster #MG5 distribution file Three types of functionality are allowed in a plugin 1. new output mode 2. new cluster support 3. new interface 1. Define new output mode example: new\_output = {'myformat': MYCLASS} madgraph will then allow the command "output myformat PATH" MYCLASS should inherated of the class madgraph.iolibs.export\_v4.VirtualExporter Key: name of output mport output ew\_output = {'standalone\_cuda':output.CUDAExporter} value: class to use example new\_cluster = {'mycluster': MYCLUSTERCLASS} allow "set cluster\_type mycluster" in madgraph MYCLUSTERCLASS should inherated from madgraph.various.cluster.Cluster new\_cluster = {} 3. Define a new interface (allow to add/modify MG5 command) This can be activated via ./bin/mg5\_aMC --mode=PLUGINNAME # Put None if no dedicated command are required ew\_interface = None OM, mgongpu $_version_ = (1,0,0)$ $\min[mal_mg5amcnlo_version = (2,3,4)]$ codegen workshop maximal\_mg5amcnlo\_version = (1000,1000,1000) 16 Sep 2021 $atest_validated_version = (2,4,0)$

Mattelaer Olivier

**5**...

mG5



82/30

6

A. Valassi – Madgraph on GPUs and vector CPUs: towards production

# Stateful interface for Fortran bridge?

- See the discussion in <a href="https://github.com/madgraph5/madgraph4gpu/issues/329">https://github.com/madgraph5/madgraph4gpu/issues/329</a>
- First point (easy): need the same Bridge instance across several ME computations
  - compute good helicities only once
  - more generally, provide a single constructor/initialization (e.g. also for Kokkos etc)
- · Second point (we should agree): static Bridge singleton vs. allow several Bridges
  - OPTION 1 (a la SR): Bridge address invisible from Fortran (use static Bridge singleton)
    - No explicit Bridge constructor call from Fortran (construct C++ Bridge on first ME calculation)
    - Assumes Fortran will not use multi-threading (but C++ Bridge can internally be MT)
    - Fewer Fortran calls to C (but we must ensure Fortran/C interface match for other methods anyway)
    - No explicit destructor, relies on static going out of scope at the end of the program
  - OPTION 2 (a la AV): Bridge address visible from Fortran (several Bridges possible)
    - Explicit Bridge constructor and destructor from Fortran (memory address is Fortran INTEGER\*8)
    - Two more Fortran calls to C (two more methods needing checks of Fortran/C interface match)
  - What's your preference? (Olivier, especially what's <u>your</u> preference?)
  - In both cases: I suggest adding a Fortran INTERFACE block to expose the C interface



AV – Madgraph4GPU WIP (pptt, bridge, split kernels, compilers etc)

24 Jan 2022





### Code generation backport – *completed* for cudacpp!

- Track overall progress on <a href="https://github.com/madgraph5/madgraph4gpu/issues/244">https://github.com/madgraph5/madgraph4gpu/issues/244</a>
  - From epoch1, epoch2... to a single evolving epochX
  - Include code generation plugin, auto code, manual code, throughput logs
  - See <a href="https://github.com/madgraph5/madgraph4gpu/tree/master/epochX/cudacpp">https://github.com/madgraph5/madgraph4gpu/tree/master/epochX/cudacpp</a>
- Various steps for Cuda/C++ plugin NB: epochX is the current latest cuda/c++ code!
  - 1. Basic infrastructure [DONE] PR #245
  - 2. Backport "epoch2 ggttgg" [DONE] PR #247, PR #254, PR #256
    - Tagged as <u>"golden\_epochX2</u>" for Kokkos/Alpaka/Sycl comparison, Fortran/Cuda bridging...
  - 3. Backport "epoch1 eemumu" (vectorization) [new: DONE] PR #253
  - -4. Generate vectorized ggttgg in epochX [new: DONE] PR #267, PR #270
    - Tagged as <u>"golden\_epochX4</u>" for Kokkos/Alpaka/Sycl comparison, Fortran/Cuda bridging...
    - Some preliminary results on the next slide
  - 5+. Ongoing/planned: cleanup, develop, iterate... (but this is beyond the backport!)
    - · Cleaned up several old tickets, tried to document important issues
    - Ongoing work on tests for ggtt/ggttgg with StephanH
    - Planned: documentation for epochX (code generation, performance/functional tests)
    - Available to help on Alpaka/Kokkos/... backport if needed
    - A few other pending developments, some in epoch1 (e.g. neppM vs neppR)



AV – WIP on Madgraph4GPU

25 Oct 2021





## **Class architecture – 3+ types of classes**

- (1) Memory buffers
  - Allocate unstructured memory (e.g. double\*) for a given number of events
  - Different classes for host/pinnedhost/device memory and for momenta, MEs etc.
- (2) Memory access
  - Define the internal substructure of memory buffers (e.g. AOSOA for momenta)
  - Define how CUDA/C++ kernels locate event records and fields in memory buffers
  - Different classes for host and device memory access and for momenta, MEs etc.
- (3) Kernel launchers
  - Constructors from references to existing input and output memory buffers
  - Computational functions with C++ interface, encapsulating C++ or CUDA kernels
    - Internally use low level Rambo/HelAmps functions for specific memory access template parameter
  - Different classes for host and device kernel processing and for Rambo, ME calculation etc.
- Applications (check\_sa, runTest, Bridge) are built in terms of these 3 "LEGO bricks"
  - Eventually: sequence classes for some of these sequences?



AV – Madgraph4GPU PRs (class structure, standalone Bridge tests etc)

10 Jan 2022





## Recap of missing pieces (in Bridge and implementation)

- My current Bridge: input momenta  $\rightarrow$  output MEs, per event (arrays)
  - Overview (not updated) in https://github.com/madgraph5/madgraph4gpu/issues/404
- Multichannel: must add input diagram ID (scalar)
  - MadEvent single-diagram enhancement: output ME multiplied by amp(ID)/sum\_amp
    - Done by Olivier in standalone\_gpu, I must integrate it in madevent + cudacpp
  - NB can ignore "get\_channel\_cut' https://github.com/madgraph5/madgraph4gpu/issues/419
- Running of alphas: must add input alphas, per event (array)
  - https://github.com/madgraph5/madgraph4gpu/issues/373
  - The QCD running coupling constant alpha\_strong is different in each event
  - What we chose (for now?): Fortran chooses the scale and passes it to cudacpp
    - Done by Stefan in standalone\_cudacpp, I must integrate it in madevent + cudacpp
- Random color choice: must add input random and output color, per event (array)
  - https://github.com/madgraph5/madgraph4gpu/issues/402
  - To be implemented in cudacpp: random picking of color
- Random helicity choice: must add input random and output helicity, per event (array)
  - https://github.com/madgraph5/madgraph4gpu/issues/403
  - To be implemented in cudacpp: random picking of helicity
- For xsec: multichannel, alphas. Only for unweighted events: random color, helicity.



AV – Madgraph4GPU WIP (madevent + cudacpp integration)

04 Apr 2022

5





# Matrix element integration in MadEvent: results

- Functional results (Madevent with Fortran MEs vs CUDA/C++ MEs, using the same random seeds)
  - -Cross section calculation: done! (Same cross section within ~E-14 relative accuracy)
  - Unweighted event generation: almost done! (Same LHE output files, except for missing color/helicity)

Performance results ⇒ Total time = Madevent time (scalar, sequential) + ME time (vector, parallel)

- The overall speedup is limited by the incompressible scalar component (we need to reduce that too!)

-<u>Amdahl's law</u>: if parallel fraction is initially p, maximum speedup is 1/(1-p)

# AVX512 on Intel Silver: x4.4 speedup for MEs, x3.9 for full workflow AVX512 on Intel Gold: x7.8 speedup for MEs, x6.4 for full workflow

CERN: Intel Silver 4216 + Nvidia V100

| $gg \rightarrow ttggg$ | [seconds] Overall = MadEvent + MEs [MEs/second]                                |                                                      |                                                                                                   |                                                              |                                                                           |  |  |
|------------------------|--------------------------------------------------------------------------------|------------------------------------------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------------------------------|--|--|
| 6k events              | FORTRAN<br>CPP/none<br>CPP/sse4<br>CPP/avx2<br>CPP/512y<br>CPP/512z<br>CUDA/32 | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ | $\begin{array}{r} 4.16 + \\ 4.89 + \\ 4.50 + \\ 4.26 + \\ 4.22 + \\ 4.34 + \\ 5.34 + \end{array}$ | 89.49<br>106.62<br>57.66<br>29.52<br>26.44<br>24.02<br>58.38 | <pre>7.19e+01 6.03e+01 1 12e+02 2.16e+02 2.43e+02 2.68e+02 1.10e+02</pre> |  |  |
| 800k events            | CUDA/8192                                                                      | 639.20 =                                             | 527.37 +                                                                                          | 111.83                                                       | 7.40e+03                                                                  |  |  |

Juwels: Intel Gold 6148

| [seconds] | Overall = M | ladEvent + | MEs [l | MEs/second] |
|-----------|-------------|------------|--------|-------------|
| FORTRAN   | 68.93 =     | 2.84 +     | 66.09  | 9.73e+01    |
| CPP/none  | 84.01 =     | 3.38 +     | 80.63  | 7.98e+01    |
| CPP/sse4  | 46.29 =     | 3.04 +     | 43.25  | 1.49e+02    |
| CPP/avx2  | 22.26 =     | 2.85 +     | 19.41  | 3.31e+02    |
| CPP/512y  |             | 2.89 +     | 17.60  | 3.66e+02    |
| CPP/512z  | 13.11 =     | 2.81 +     | 10.30  | 6.24e+02    |
|           |             |            |        |             |

GPU: ~x120 speedup for MEs, only ~x20 for full workflow [Amdahl:  $p = 0.95 \Rightarrow max \ speedup = 20$ ]

(ME speedup would be ~x300 with 16k+ events per GPU grid, but Madevent CPU memory is limited to ~8k per grid)

5.

ICHEP, Bologna, 8 July 2022



A. Valassi – Madgraph on GPUs and vector CPUs: towards production

#### See <a href="https://github.com/madgraph5/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348e56a0477f0c85bab3d8eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_OUTPUT/madgraph4gpu/raw/c2d22aebd45348eab18ab/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP\_SA\_

Copyright (C) 2020-2023 CERN and UCLouvain. Licensed under the GNU Lesser General Public License (version 3 or later). All rights not expressly granted are reserved.

The copyright and license notice above cover the CUDACPP code-generating plugin of the MadGraph5\_aMC@NLO (in the following "MG5aMC") software, and all code generated using that plugin. These are collectively referred to as "this work" or "the MG5aMC CUDACPP plugin and the code that it generates", or more simply as "the MG5aMC CUDACPP plugin", in the following and throughout this work.

The MG5aMC CUDACPP plugin and the code that it generates are based on the initial work on porting MG5aMC to GPUs using CUDA and on speeding up MG5aMC on CPUs using vectorized C++ by three original authors from CERN and UCLouvain. The full development team currently includes the following authors :

Stephan Hageboeck (CERN)

Olivier Mattelaer (Universite Catholique de Louvain, original author) Stefan Roiser (CERN, original author)

Andrea Valassi (CERN, original author)

Zenny Wettersten (CERN)

See https://github.com/madgraph5/madgraph4gpu for more details. For the full list of authors and collaborators of this work, see the file "AUTHORS" in the same directory as this "COPYRIGHT" file in the source code of the plugin.

The MG5aMC CUDACPP plugin and the code that it generates are derived from, and are intended to be used in combination with, the MG5aMC software and the code that it generates. The MG5aMC software is developed by the MadGraph5\_aMC@NLO development team and contributors, also known as the "MadTeam", who are the owners of its copyright and have licensed it as specified in https://qithub.com/mg5amcnlo/blob/main/madgraph/LICENSE.

The MG5aMC CUDACPP plugin and the code that it generates are free software; you can redistribute them and/or modify them under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 or (at your option) any later version.

This work is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

The GNU Lesser General Public License (LGPL) version 3 is copied verbatim in the file "COPYING.LESSER" in the same directory as this "COPYRIGHT" file. It is also available at <https://www.gnu.org/licenses/lgpl-3.0.txt>.

This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License (GPL), which is copied verbatim in the file "COPYING" in the same directory as this "COPYRIGHT" file and is also available at <https://www.gnu.org/licenses/gpl-3.0.txt>.

In line with the license above, the authors emphasise the following points. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

## Proposed COPYRIGHT file

This would be distributed with the code-generating plugin, and also in generated code

AV – cudacpp licensing

25 April 2023

8



A. Valassi – Madgraph on GPUs and vector CPUs: towards production



# A tale of two repositories (now)

#### mg5amcnlo

#### https://github.com/mg5amcnlo/mg5amcnlo

- the MG5AMC repo (previously launchpad)
- python framework, fortran codegen
- permissive NCSA-style license

A specific commit is in madgraph4gpu —

Important branches for GPU/SIMD work:

- **gpucpp** (the baseline: merge here!)
- gpucpp\_june24 (channelid array) мексер
- gpucpp\_goodhel (new helicity filter) wip
- gpucpp\_for360 (complete 3.6.0 sync) wip

#### madgraph4gpu

https://github.com/madgraph5/madgraph4gpu

- cudacpp plugin (cuda/c++ codegen)
- generated code, tests, results (+legacy stuff)
- more restrictive LGPL license
- Includes mg5amcnlo as a git submodule

Important branches for GPU/SIMD work:

- master (the baseline: merge here!)
- master\_june24 (...) мексер
- master\_goodhel (...) wip
- master\_for360 (...) WIP

Status: finally merged "june24" this week; now fixing the conflicts with "goodhel" and "for360" Aim for a v3.6.0 release including the GPU/SIMD support... possibly by end September!?



A. Valassi – status of MG5AMC (LO) on GPU and SIMD

6 September 2024 3/8



A. Valassi – Madgraph on GPUs and vector CPUs: towards production



# Still a tale of two repositories (later)?

#### Option 1 – our assumption so far

#### mg5amcnlo

https://github.com/mg5amcnlo/mg5amcnlo

• *the* MG5AMC repo (NCSA-style) *Includes cudacpp as a git submodule* A specific commit is in mg5amcnlo in PLUGIN/CUDACPP OUTPUT

#### mg5amcnlo cudacpp (OLD WIP AUG 2023)

https://github.com/mg5amcnlo/mg5amcnlo cudacpp

• cudacpp plugin (LGPL)

#### Option 2? – recent discussion AV/OM

mg5amcnlo https://github.com/mg5amcnlo/mg5amcnlo Includes cudacpp as a subdirectory in PLUGIN/CUDACPP OUTPUT

Advantages/Disadvantages?

- Option 1 gives cleaner separation; but merge conflicts with git submodules are hard

- Option 2 is easier to manage, but more monolithic; following up if licensing is ok



A. Valassi - status of MG5AMC (LO) on GPU and SIMD

6 September 2024 4/8



A. Valassi – Madgraph on GPUs and vector CPUs: towards production



## Packaging – two weeks ago now



5 A.

A. Valassi - Madgraph on GPUs and vector CPUs: towards production

