



# Developments in software performance and portability for Madgraph5\_aMC@NLO

Taylor Childers Walter Hopkins Nathan Nichols



Laurence Field Stephan Hageboeck Stefan Roiser David Smith Andrea Valassi



**Olivier Mattelaer** 



ICHEP, Bologna, 8<sup>th</sup> July 2022 https://agenda.infn.it/event/28874/contributions/169193

#### Outline

#### Introduction

- -Monte Carlo generators in WLCG computing
- -Madgraph5\_aMC@NLO (MG5aMC) and the madgraph4gpu project
- -Monte Carlo matrix element generators and data parallelism
- Results and outlook in three main areas of development
  - (1) ME calculation in the 'cudacpp' implementation (C++ with vectorization on CPU, CUDA on Nvidia GPUs)
  - (2) ME calculation in C++ portability frameworks (Alpaka, Kokkos, Sycl on CPUs and on Nvidia/AMD/Intel GPUs)
  - (3) Integration of C++ based ME calculations into the Madevent Fortran framework
- Conclusions





# Motivation: Monte Carlo Event Generators in WLCG computing

- LHC computing needs are predicted to outpace resource growth on HL-LHC timescales
  - -Need aggressive R&D to improve software efficiency and port it to new architectures and resources
  - GPUs increasingly important, in site clusters but also HPC centres (already used opportunistically in WLCG)
  - Performance portability frameworks enable use of new systems without writing multiple software versions







| https://doi.org/10.1007/s41781-021-00055-1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Computing and Software for Big Science (2021) 5:12<br>https://doi.org/10.1007/s41781-021-00055-1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| ORIGINAL ARTICLE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Challenges in Monte Carlo Event Generator Software<br>for High-Luminosity LHC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| The HSF Physics Event Generator WG · Andrea Valassi <sup>1</sup> <sup>®</sup> · Efe Yazgan <sup>2</sup> <sup>®</sup> · Josh McFayden <sup>1,3,4</sup> <sup>®</sup> · Simone Amoroso <sup>5</sup> · Joshua Bendavid <sup>1</sup> · Andy Buckley <sup>6</sup> · Matteo Cacciari <sup>7,8</sup> · Taylor Childers <sup>9</sup> · Vitaliano Ciulli <sup>10</sup> · Rikkert Frederix <sup>11</sup> · Stefan O Frixione <sup>12</sup> · Francesco Giuli <sup>13</sup> · Alexander Grohsjean <sup>5</sup> · Christian Gütschow <sup>14</sup> · Stefan Höche <sup>15</sup> · Walter Hopkins <sup>9</sup> · Philip Ilten <sup>16,17</sup> · Dmitri Konstantinov <sup>18</sup> · Frank Krauss <sup>19</sup> · Qiang Li <sup>20</sup> · Leif Lönnblad <sup>11</sup> · Fabio Maltoni <sup>21,22</sup> · Michelangelo Mangano <sup>1</sup> · Zach Marshall <sup>3</sup> · Olivier Mattelaer <sup>22</sup> · Javier Fernandez Menendez <sup>23</sup> · Stephen Mrenna <sup>15</sup> · Servesh Muralidharan <sup>1,9</sup> · Tobias Neumann <sup>14,24</sup> · Simon Plätzer <sup>25</sup> · Stefan Prestel <sup>11</sup> . Stefan Roiser <sup>1</sup> · Marek Schönherr <sup>19</sup> · Holger Schulz <sup>17</sup> · Markus Schulz <sup>1</sup> · Elizabeth Sexton-Kennedy <sup>15</sup> · Frank Siegert <sup>26</sup> · Andrzej Siódmok <sup>27</sup> · Graeme A. Stewart <sup>1</sup> |

Received: 18 May 2020 / Accepted: 2 March 2021 / Published online: 22 May 2021

- MC generators, the essential 1<sup>st</sup> step in simulation, use 10-20% of ATLAS/CMS WLCG CPU budget
  - -Many ways to speed up their performance see the HEP Software Foundation (HSF) Generator WG review

- MC generators are ideal candidates to exploit data parallelism in GPUs (SIMT) and in vector CPUs (SIMD)



# Madgraph5\_aMC@NLO (MG5aMC)

- One of the workhorses for event generation in ATLAS and CMS!
  - $-\,\text{SM}$  and BSM, LO and NLO, integration with PDF and loop libraries...
  - -Matrix Element (ME) calculations, merging of multi-jet final states,

NLO matching of MEs and Parton Showers (PS)...





https://doi.org/10.1007/JHEP07(2014)079





- MG5aMC production version in Fortran
  - Software outer shell: Madevent
    - A Fortran/Python/bash framework for phase space random sampling, integration and unweighted event generation
  - Software inner core: ME calculation code, automatically generated for each physics process
    - Production version in Fortran (but simpler, non optimized versions exist also in Python and C++)
    - Matrix Element calculations take 95%+ of the CPU time for complex processes (e.g.  $gg \rightarrow t\bar{t}ggg$ )



# MG5aMC and the madgraph4gpu project

- *madgraph4gpu: speed up ME calculation in MG5aMC* on modern hardware (GPUs and vector CPUs)
  - -Collaboration of theoretical/experimental physicists with software engineers born in the HSF generator WG
  - It would not be possible without Olivier Mattelaer (MG5AMC co-author and current main maintainer) !
- Previous results were presented at vCHEP2021 (May 2021):
  (1) Only a simple e<sup>+</sup>e<sup>-</sup>→µ<sup>+</sup>µ<sup>-</sup> process, hardcoded one-off CUDA/C++
  (2) In C++ with vectorization for CPUs, in CUDA only for Nvidia GPUs
  (3) Only a standalone application (not usable by the experiments)

| EPJ Web of Confe<br>CHEP 2021 | Web of Conferences <b>251</b> , 03045 (2021)<br>EP 2021 |  |      | https:/ | /doi.org/10.10 | 51/epjconf/202125 |  |
|-------------------------------|---------------------------------------------------------|--|------|---------|----------------|-------------------|--|
|                               |                                                         |  | 1.11 |         |                |                   |  |

Design and engineering of a simplified workflow execution for the MG5aMC event generator on GPUs and vector CPUs

Andrea Valassi<sup>1,\*</sup>, Stefan Roiser<sup>1,</sup>, Olivier Mattelaer<sup>2</sup>, and Stephan Hageboeck<sup>1</sup> <sup>1</sup>CERN, IT-SC group, Geneva, Switzerland <sup>2</sup>Université Catholique de Louvain, Belgium

https://doi.org/10.1051/epjconf/202125103045

- Two main goals for our current efforts in 2022
  - Release MG5AMC for LO (no NLO yet!) event generation in ATLAS/CMS (CPU SIMD speedups and GPU port)
  - Gain experience for the HEP software community on the usefulness of portability frameworks (PFs)
- Main new progress since May 2021:
  - (1) Code generation plugins instead of one-off code: performance results for complex  $gg \rightarrow t\bar{t}ggg$  processes
  - (2) Additional implementations with PFs (Alpaka, Kokkos, Sycl), e.g. also for AMD and Intel GPUs
  - (3) Integration of CUDA/C++ ME calculation into Madevent: cross sections done, event generation almost done



0304

# MG5aMC computational anatomy and data parallelism strategy

In MC generators, <u>the same function is used to compute the Matrix Element for many different events</u>

 ANY matrix element generator is a good fit for lockstep processing on GPUs (SIMT) and vector CPUs (SIMD)
 Data parallelism strategy in madgraph4gpu is event-level parallelism (many events = many phase space points)



Software performance and portability in Madgraph5\_aMC@NLO



# Aside – Monte Carlo's: what about branching?

- Monte Carlo methods are based on drawing (pseudo-)random numbers: a dice throw
- From a software workflow point of view, these are used in *two rather different cases*:



Software performance and portability in Madgraph5\_aMC@NLO

ICHEP, Bologna, 8 July 2022

Université



# Code generation: from many "epochs" to a single evolving "epoch"



Software performance and portability in Madgraph5\_aMC@NLO

ICHEP, Bologna, 8 July 2022



8

# Matrix Element (ME) calculation in cudacpp: results

First line of development: the "cudacpp" plugin to calculate MEs in C++ (CPUs) or CUDA (GPUs) (1)Single code base for C++ and CUDA (with #ifdef's): original development, currently the most advanced Exploit SIMD vectorization through explicit Compiler Vector Extensions (gcc, clang, icpx)

| Implementation<br>$(gg \rightarrow t\bar{t}gg)$                             | MEs/second<br>Double          | MEs/second<br>Float     | ~ | Helicity recycling (diffe                                            | erent/faster a                | algorithm)               |
|-----------------------------------------------------------------------------|-------------------------------|-------------------------|---|----------------------------------------------------------------------|-------------------------------|--------------------------|
| 1-core MadEvent Fortran<br>scalar                                           | 3.96E3<br>(x2.2)              |                         |   | Implementation $(gg \rightarrow t\bar{t}gg)$                         | MEs/second<br>Double          | MEs/second<br>Float      |
| 1-core Standalone C++<br>scalar                                             | 1.84E3<br><b>(x1.00)</b>      | 1.80E3<br>(x0.98)       |   | 1-core Standalone C++<br>scalar                                      | 2.39E3<br>(x1.00)             | 2.50E3<br>(x1.05)        |
| 1-core Standalone C++<br>128-bit SSE4.2<br>(x2 doubles, x4 floats)          | 3.36E3<br>(x1.8)              | 6.60E3<br>(x3.6)        |   | 1-core Standalone C++<br>128-bit SSE4.2<br>(x2 doubles, x4 floats)   | 4.59E3<br>(x1.9)              | 9.42E3<br>(x3.6)         |
| 1-core Standalone C++<br>256-bit AVX2<br>(x4 doubles, x8 floats)            | 6.86E3<br>(x3.7)              | 1.31E4<br>(x7.1)        |   | 1-core Standalone C++<br>256-bit AVX2<br>(x4 doubles, x8 floats)     | 1.06E4<br>(x4.4)              | 2.15E4<br>(x9.0)         |
| 1-core Standalone C++<br>"256-bit" AVX512<br>(x4 doubles, x8 floats)        | 7.68E3<br><mark>(x4.2)</mark> | 1.41E4<br><b>(x7.7)</b> |   | 1-core Standalone C++<br>"256-bit" AVX512<br>(x4 doubles, x8 floats) | 1.15E4<br>(x4.8)              | 2.28E4<br>(x9.5)         |
| 1-core Standalone C++<br>512-bit AVX512<br>(x8 doubles, x16 floats)         | 6.52E3<br>(x3.5)              | 1.32E4<br>(x7.2)        |   | 1-core Standalone C++<br>512-bit AVX512<br>(x8 doubles, x16 floats)  | 1.96E4<br><mark>(x8.2)</mark> | 4.03E4<br><b>(x16.9)</b> |
| Standalone CUDA<br>NVidia V100S-PCIE-32GB<br>(TFlops*: 7.1 FP64, 14.1 FP32) | 4.89E5<br><mark>(x270)</mark> | 9.27E5<br><b>(x500)</b> |   | Intel Gold 6148 CPU (Ju<br>Better AVX512/zmm res                     |                               | ,                        |

Main new results since vCHEP2021:

 Backport to code generation Speedups previously reported for ee\_mumu now ~confirmed for gg\_ttgg - CPUs/SIMD: x4 double, x8 float - GPUs/V100: x270 double, x500 float (could do better, high register pressure)

Extra x2 from AVX512 on Intel Gold: Achieve theoretical limit of speedup: x8 double, x16 float on AVX512 CPU

New features for MadEvent integration



Intel Silver 4216 CPU (CERN) Poor AVX512/zmm results ⇔ One FMA unit?

(one single thread)

Software performance and portability in Madgraph5\_aMC@NLO

# Portability Frameworks (PFs)

(2) Second line of development: MEs on PFs

- PFs allow writing algorithms once and running on many architectures with some hardware-specific optimizations
- CUDA code can only run on NVidia GPUs, while <u>Kokkos</u>, <u>Alpaka</u>, and <u>Sycl[Intel]</u> codes can run on most hardware
- In "cudacpp", #ifdef directives separate code branches for GPU and CPU code during compilation (but these are very few: only kernel launching and memory access, not MEs)
- With PFs, the algorithm is typically the same, but the compilation occurs once per architecture type
- PFs often use templating to handle data types and hardware configuration and function lambdas or pointers for passing kernels (the cudacpp plugin has many of these, too)
- PFs still require user to think about "host" vs "device"

# Kokkos al Sycl.

#### "cudacpp" example of compiler directives

| et(), devMEs.get() |
|--------------------|
| et(), devMEs.get() |
| et(), devMEs.get() |
|                    |
|                    |
| (float)>>>(devMome |
|                    |
|                    |
| for GPU            |
|                    |
| For CPU            |
|                    |
|                    |

#### Kokkos example of Templating & lambda

| 324 | {                                                                                                                  |
|-----|--------------------------------------------------------------------------------------------------------------------|
| 325 | <pre>using member_type = typename Kokkos::TeamPolicy<kokkos::defaultexecut< pre=""></kokkos::defaultexecut<></pre> |
| 326 | Kokkos::TeamPolicy <kokkos::defaultexecutionspace> policy( league_size</kokkos::defaultexecutionspace>             |
| 327 | Kokkos::parallel_for(func,policy,                                                                                  |
| 328 | KOKKOS_LAMBDA(member_type team_member){                                                                            |
| 320 |                                                                                                                    |

#### Kokkos example of Memory Management

262 Kokkos::View<fptype\*\*\*,Kokkos::DefaultExecutionSpace> devMomenta(Kokkos::ViewAllocateWithoutInitializing("devMomenta"),nevt,npar,np4); 263 auto hstMomenta = Kokkos::create\_mirror\_view(devMomenta);



Software performance and portability in Madgraph5\_aMC@NLO

# ME calculation in PFs: GPU results (Nvidia A100)

Throughput scaling (threads, blocks) for a simple  $e^+e^- \rightarrow \mu^+\mu^-$  process and a complex  $gg \rightarrow t\bar{t}gg$  process (note: this is an older version of the code with respect to the results shown earlier for cudacpp alone)



- This and the next slide show both ee\_mumu and gg\_ttgg for comparison, but *please focus only on the gg\_ttgg results!* 
  - The ME calculations in ee\_mumu are extremely simple: the overhead of CPU-GPU memory copies on total MEs/s is huge (and maybe was handled differently in the 4 implementations?)
- Good news 1: for gg\_ttgg, all four implementations look similar!
   The benefit of direct CUDA over a PF is limited, if any at all

En passant, keep in mind this for later: you need at least 16k
 "events per GPU grid" to fill up a V100 or A100 with gg\_ttgg+
 Simpler processes need even more, e.g. 500k for ee\_mumu

Software performance and portability in Madgraph5\_aMC@NLO



# ME calculation in PFs: GPU results (Nvidia, Intel, AMD)

Maximum throughput for a simple  $e^+e^- \rightarrow \mu^+\mu^-$  process and a complex  $gg \rightarrow t\bar{t}gg$  process

(note: this is an older version of the code with respect to the results shown earlier for cudacpp alone)



- Again, please focus only on the gg\_ttgg results!
- Good news 1: for gg\_ttgg, all four look similar on Nvidia!
   The benefit of direct CUDA over a PF is limited, if any at all
- Good news 2: PFs also work on AMD and Intel GPUs!
   –Out of the box, with a single implementation

(There is no Alpaka on Intel in the plots because we use Cupla: we should move to using native Alpaka)

Xe-HP is a software development vehicle for functional testing only. It is currently used at Argonne and at other customer sites to prepare their code for future Intel data centre GPUs

ICHEP, Bologna, 8 July 2022



Software performance and portability in Madgraph5\_aMC@NLO

## ME calculation in PFs: CPU results (preliminary! need systematic study)

Maximum throughput for five processes, from simple  $(e^+e^- \rightarrow \mu^+\mu^-)$  to more complex  $(gg \rightarrow t\bar{t}ggg)$ (note: this is an older version of the code with respect to the results shown earlier for cudacpp alone)

- CPUs have two very different parallelisms we can exploit:
  - -Many floats/doubles per vector register: vectorization (SIMD)
  - -Many physical/virtual cores: multi-threading (or many processes!)



- *NB: this plot is comparing apples to oranges and to peaches!* –Fortran: one single thread, no vectorization
  - -Kokkos: internal multithreading? limited auto-vectorization?
  - -SYCL: internal multithreading? limited auto-vectorization?
  - -cudacpp: OpenMP multithreading, *explicit vectorization (CVE)* 
    - The OMP multithreading in the cudacpp plugin is known to be suboptimal and will be reengineered (probably with std::thread instead)

On CPUs, for the moment, it seems better to use ad-hoc developments as in cudacpp, than rely on PFs (NB: you may replace OMP by many applications in parallel, but you must do low-level coding to get a factor x4 or more from SIMD)



### Matrix Element (ME) calculation in cudacpp and PFs: outlook

Short term (end 2022?)

- (Nvidia GPUs) Further improve CUDA performance with smaller kernels
  - Exploit tensor cores for color algebra in cudacpp? Would tensor cores be supported by PFs?
  - Finer grained strategy for distributing work on the GPU(s)? Multi-GPU support?
- (AMD/Intel GPUs) Add direct HIP to cudacpp implementation, in parallel advance in PF implementations
- (CPUs, multithreading) Replace OpenMP by std::thread; systematic thread scaling studies in cudacpp and PFs
  - Containerize the standalone application and collaborate on scaling studies with the HEPiX benchmarking WG
- (CPUs, vectorization) Systematic vectorization studies in PF implementations
- (CPUs, GPUs) Numerical precision studies: stress tests of -O3 and fast math (our default assumption...)

Medium term (2023+)

- (CPUs, GPUs) Implement helicity recycling in cudacpp (additional x2-3 algorithmic speedup, now only in Fortran)
- (CPUs, GPUs) Handle NLO: loops and matching to PS
- (CPUs, GPUs) Numerical precision studies: would float be enough? (additional x2 speedup over double)



# Matrix element integration in MadEvent: overview

(3) Third line of development: replacing Fortran by cudacpp MEs in Madevent (keep the user interface!)

Linking Fortran and C++ has been easy. As expected, the two main issues have been, instead:

- -1. Moving Madevent from single-event to many-event (need 16k+ per GPU grid  $\Rightarrow$  huge arrays in CPU memory!)
- -2. Debugging the issues caused by hidden inputs and outputs, largely coming from Fortran common blocks





Software performance and portability in Madgraph5\_aMC@NLO

# Matrix element integration in MadEvent: results

- Functional results (Madevent with Fortran MEs vs CUDA/C++ MEs, using the same random seeds)
  - -Cross section calculation: done! (Same cross section within ~E-14 relative accuracy)
  - Unweighted event generation: almost done! (Same LHE output files, except for missing color/helicity)
- Performance results  $\Rightarrow$  Total time = Madevent time (scalar, sequential) + ME time (vector, parallel)
  - -The overall speedup is limited by the incompressible scalar component (we need to reduce that too!)
  - -<u>Amdahl's law</u>: if parallel fraction is initially p, maximum speedup is 1/(1-p)

AVX512 on Intel Silver: x4.4 speedup for MEs, x3.9 for full workflow AVX512 on Intel Gold: x7.8 speedup for MEs, x6.4 for full workflow

CERN: Intel Silver 4216 + Nvidia V100

| $gg \rightarrow ttggg$ | [seconds                                                            | ] Overall = I                                        | MadEvent +                                                                              | - MEs                                               | [MEs/second]                                                                   |
|------------------------|---------------------------------------------------------------------|------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------|--------------------------------------------------------------------------------|
| 6k events              | FORTRAN<br>CPP/none<br>CPP/sse4<br>CPP/avx2<br>CPP/512y<br>CPP/512z | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ | $\begin{array}{r} 4.16 + \\ 4.89 + \\ 4.50 + \\ 4.26 + \\ 4.22 + \\ 4.34 + \end{array}$ | 89.49<br>106.62<br>57.66<br>29.52<br>26.44<br>24.02 | 7.19e+01<br>  6.03e+01<br>  1 12e+02<br>  2.16e+02<br>  2.43e+02<br>  2.68e+02 |
|                        | CUDA/32                                                             | 63.72 =                                              | 5.34 +                                                                                  | 58.38                                               | 1.10e+02                                                                       |
| 800k events            | CUDA/8192                                                           | 639.20 =                                             | 527.37 +                                                                                | 111.83                                              | 7.40e+03                                                                       |

Juwels: Intel Gold 6148

| [seconds] | Overall = Ma | adEvent + I | MEs   | [MEs/second] |
|-----------|--------------|-------------|-------|--------------|
| FORTRAN   | 68.93 =      | 2.84 +      | 66.09 | 9.73e+01     |
| CPP/none  |              | 3.38 +      | 80.63 | 7.98e+01     |
| CPP/sse4  |              | 3.04 +      | 43.25 | 1.49e+02     |
| CPP/avx2  |              | 2.85 +      | 19.41 | 3.31e+02     |
| CPP/512y  |              | 2.89 +      | 17.60 | 3.66e+02     |
| CPP/512z  | 13.11 =      | 2.81 +      | 10.30 | 6.24e+02     |

GPU: ~x120 speedup for MEs, only ~x20 for full workflow [Amdahl:  $p = 0.95 \Rightarrow max \ speedup = 20$ ]

(ME speedup would be ~x300 with 16k+ events per GPU grid, but Madevent CPU memory is limited to ~8k per grid)



#### Matrix element integration in MadEvent: outlook

#### Very short term (Q3 2022 – alpha release for the experiments)

- Implement event-by-event random choice of colors and helicities in cudacpp (goal: same LHE files!)
- Cross-check the few last details (pdfs, user parameters...)

Short to medium term (end 2022 – 2023)

- Reduce overhead from scalar Madevent framework (goal: overall speedups closer to ME speedups)
  - This is currently the bottleneck preventing higher throughputs for the overall workflow using GPUs
  - One possible option: heterogeneous workflow (multithreaded Madevent on CPU, parallel ME on GPU)?
- Reduce number of Fortran arrays in Madevent (goal: lower CPU memory, allow larger GPU grids beyond 8k)



#### Conclusions

- ALL Matrix Element Generators are perfect fits to exploit CPU vectorization/SIMD and GPUs

   Lockstep parallelism in MEGs much easier to exploit than in detector simulation (Geant4, stochastic branching)
- An alpha release of MG5aMC for LO with GPU ports and CPU speedups from SIMD is imminent

   Cross section calculation is ready; a few details to fix for unweighted event generation (random color/helicity...)
- On Intel Gold CPUs, AVX512 C++ is x8 faster than scalar C++ for ME calculations (in double precision)

   A slightly lower speedup ~x6 holds for the full Madevent + ME workflow (Amdahl's law, as Madevent is scalar)
   Overall speedup ~x5 compared to Fortran (comparing to the old Fortran release without helicity recycling)
   An additional x2-3 algorithmic speedup will come through helicity recycling (not yet in cudacpp)
- On GPUs, much larger O(300+) speedups may be achieved for the ME calculation

   But we must reduce the scalar component in Fortran MadEvent to see those in the full workflow (Amdahl's law)
- Additional x2 speedups may be achieved on CPUs and GPUs by moving from double to single precision
- Portability Frameworks work well for us! Simplify development with a single code for many GPU flavors

   Similar performance to direct CUDA on Nvidia GPUs; we may also run out of the box on AMD and Intel GPUs

Software performance and portability in Madgraph5\_aMC@NLO



# BACKUP SLIDES

Argonne (Astocatory ) UCL Universite catholique (Balance) 19

Software performance and portability in Madgraph5\_aMC@NLO

#### Acknowledgements

- We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
- We gratefully acknowledge the use (under PRACE proposal PRACE-DEV-2022D01-022) of the JUWELS supercomputer and other computing resources provided and operated by the Jülich Supercomputing Centre at Forschungszentrum Jülich.

#### CPU throughput plots – SIMD + multi-core

• Two different throughput speedup factors multiply each other: SIMD and multi-core

- SIMD: fewer instructions per processor (e.g. in AVX2 each instruction applies to 4 doubles) - Multi-core: many cores used in parallel (e.g. multiple jobs, multi-threading, multi-processing)





#threads

•

16

• 32

21

40

per MT job

🛧 🛧 1 (ST)





Software performance and portability in Madgraph5\_aMC@NLO

#### **CUDA: Profiling with NVidia NSight Compute – ncu**

- We regularly profile CUDA with ncu [both one-off studies and on-commit checks] – Thanks to our mentors at the Sheffield GPU hackathon for getting us started!
- We see *no evidence of thread divergence* [branch efficiency is 100%]
- Our AOSOA layout ensures coalesced memory access [requests vs transactions]
- We continuously *monitor register pressure* decreasing it is one of our future goals
   We plan to split the ME computation into many kernels coordinated by CUDA Graphs





Software performance and portability in Madgraph5\_aMC@NLO

# EVEN MORE BACKUP SLIDES



Software performance and portability in Madgraph5\_aMC@NLO

# Argonne's Joint Laboratory for System Evaluation (JLSE)

We used JLSE systems to run all performance tests described for Alpaka/Kokkos/Sycl vs Cuda/OpenMP

| <u>NVidia A100 Nodes</u><br>AMD 7532 32c 2.4Ghz<br>DDR4-3200 256GB (8x32G DIMMs) RAM<br>1x Nvidia A100 40GB PCIe 4.0<br>Mellanox ConnectX-6 EDR         | Iris Nodes<br>Intel Xeon E3-1585 v5 CPU w/ Intel Iris Pro Graphics P580<br>4x 16GB DDR4-2666 SODIMMs (operating at DDR4-2133)<br>1GbE Onboard                                                                            | AMD MI100 Nodes<br>2x AMD EPYC 7543 32c (Milan)<br>4x AMD MI100 32GB GPUs<br>Infinity Fabric<br>512GB DDR4-3200           |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| NVidia V100 Nodes<br>4x NVIDIA Tesla V100 SXM2 w/32GB HBM2<br>2x Intel Xeon Gold 6152 CPU 22c 2.10GHz<br>192GB RAM DDR4-2666<br>Mellanox ConnectX-5 EDR | Arcticus Nodes<br>2x Intel development GPU card (Codename XeHP_SDV)<br>2x Intel(R) Xeon Gold 6336Y CPU (48 physical cores total) 2.4Ghz<br>256GB: 16x 16GB DDR4 @ 3200<br>Mellanox ConnectX-6: EDR InfiniBand (100 Gbps) | AMD MI50 Nodes<br>Gigabyte G482-Z51<br>2x 7742 64c Rome<br>4x AMD MI50 32GB GPUs<br>Infinity Fabric<br>256GB DDR-3200 RAM |

Skylake Nodes Intel S2600WF, 2x Intel Xeon Platinum 8180M CPU @ 2.50GHz 768GB RAM



# Build environment on JLSE (Sycl)

• We used <u>JLSE systems</u> to run all performance tests described here for Alpaka/Kokkos/Sycl

<u>NVidia A100 Nodes</u> Intel oneAPI DPC++ (commit b9cb1d1247e2) CUDA 11.6.2

Intel oneAPI DPC++ (NDA)

Arcticus Nodes Intel oneAPI DPC++ (NDA)

Skylake Nodes

Intel oneAPI DPC++ (2021.4.0)

AMD MI100 Nodes

AMD MI50 Nodes

**ROCM 4.5.2** 

Intel oneAPI DPC++ (commit b9cb1d1247e2) ROCM 4.5.2

Intel oneAPI DPC++ (commit b9cb1d1247e2)

#### NVidia V100 Nodes

Intel oneAPI DPC++ (commit b9cb1d1247e2) CUDA 11.6.2



# Build environment on JLSE (Kokkos)

We used <u>JLSE systems</u> to run all performance tests described for Alpaka/Kokkos/Sycl vs Cuda/OpenMP

| <u>NVidia</u> | A100 | Nodes |
|---------------|------|-------|
|---------------|------|-------|

Kokkos 3.5.00 CUDA 11.6.2 g++ 9.4.0

#### NVidia V100 Nodes

Kokkos 3.5.00 CUDA 11.6.2 g++ 9.4.0 <u>Iris Nodes</u> Intel oneAPI DPC++ (NDA) Kokkos (NDA)

<u>Arcticus Nodes</u> Intel oneAPI DPC++ (NDA) Kokkos (NDA)

<u>Skylake Nodes</u> Intel oneAPI DPC++ (NDA) Kokkos (NDA)

| AMD MI100 Nodes |
|-----------------|
| Kokkos 3.5.00   |
| ROCM 4.5.2      |

AMD MI50 Nodes Kokkos 3.5.00 ROCM 4.5.2

# Build environment on JLSE (Alpaka)

We used JLSE systems to run all performance tests described for Alpaka/Kokkos/Sycl vs Cuda/OpenMP

#### NVidia A100 Nodes

Kokkos 3.5.00 CUDA 11.6.2 g++ 9.4.0

#### NVidia V100 Nodes

Kokkos 3.5.00 CUDA 11.6.2 g++ 9.4.0 <u>Iris Nodes</u> Intel oneAPI DPC++ (NDA) Kokkos (NDA)

<u>Arcticus Nodes</u> Intel oneAPI DPC++ (NDA) Kokkos (NDA)

<u>Skylake Nodes</u> Intel oneAPI DPC++ (NDA) Kokkos (NDA) AMD MI100 Nodes Kokkos 3.5.00

ROCM 4.5.2

AMD MI50 Nodes Kokkos 3.5.00 ROCM 4.5.2



### Build environment on JLSE (Cuda and OpenMP)

We used <u>JLSE systems</u> to run all performance tests described for Alpaka/Kokkos/Sycl vs Cuda/OpenMP

| NVidia A100 Nodes |  |
|-------------------|--|
| CUDA 11.6.2       |  |
| g++ 9.4.0         |  |

#### NVidia V100 Nodes

CUDA 11.6.2 g++ 9.4.0

#### **Skylake Nodes**

g++ 11.3.0 OMP\_NUM\_THREADS=56



#### Thread and block scaling for a simple $e^+e^- \rightarrow \mu^+\mu^-$ process

(note: this is an older version of the code with respect to the results shown earlier for cudacpp and complex  $gg \rightarrow t\bar{t}ggg$  processes)



NVIDIA A100 — ee mumu



#### Thread and block scaling for a simple $e^+e^- \rightarrow \mu^+\mu^-$ process

(note: this is an older version of the code with respect to the results shown earlier for cudacpp and complex  $gg \rightarrow t\bar{t}ggg$  processes)



NVIDIA A100 — ee mumu



#### Thread and block scaling for a simple $e^+e^- \rightarrow \mu^+\mu^-$ process

(note: this is an older version of the code with respect to the results shown earlier for cudacpp and complex  $gg \rightarrow t\bar{t}ggg$  processes)



NVIDIA A100 — ee mumu



#### Thread and block scaling for a simple $e^+e^- \rightarrow \mu^+\mu^-$ process

(note: this is an older version of the code with respect to the results shown earlier for cudacpp and complex  $gg \rightarrow t\bar{t}ggg$  processes)



NVIDIA A100 — ee mumu







Software performance and portability in Madgraph5\_aMC@NLO

ICHEP, Bologna, 8 July 2022

34

#### Code is auto-generated $\Rightarrow$ Iterative development process

- User chooses process, *MG5aMC determines Feynman diagrams and generates code*  – Currently Fortran (default), C++, or Python
  - The more particles in the collision, the more Feynman diagrams and the more lines of code



| Process                         | LOC  | functions | function calls |
|---------------------------------|------|-----------|----------------|
| $e^+e^- \rightarrow \mu^+\mu^-$ | 776  | 8         | 16             |
| $gg \rightarrow t\bar{t}$       | 839  | 10        | 22             |
| $gg \rightarrow t\bar{t}g$      | 1082 | 36        | 106            |
| $gg \rightarrow t\bar{t}gg$     | 1985 | 222       | 786            |

• Goal: modify code-generating code (add CUDA, improve C++ backend)

- (1) Start simple: bootstrap with  $e^+e^- \rightarrow \mu^+\mu^-$  (two diagrams, few lines of C++ code).
- (2,3) Add CUDA and improve C++, port upstream to Python meta-code/
- (4) Generate more complex LHC processes  $gg \rightarrow t\bar{t}, t\bar{t}g, t\bar{t}gg$
- Add missing functionality, fix issues, improve performance, iterate





MADGRAPH

FIRST

C++ CODE

DEVELO

ON TOP

ENGINEERED

CUDA/C++ CODE

NTEGRATE

**JPSTREAM** 

start new

"epoch"

(2)

(3)

#### A complex outer shell – with a CPU-intensive core: the ME

- To generate unweighted events in MG5aMC: execute a "gridpack"
  - Python and bash scripts launching multiple instances of a Fortran application (madevent)
  - A complex software infrastructure with many functionalities and a stable user interface



- Overall, <u>the ME calculation is the CPU bottleneck</u> (Fortran routine matrix1)
  - Fraction of time spent in ME increases with number of events and process complexity-

|          | $gg  ightarrow tar{t}$ | $gg  ightarrow t\bar{t}gg$ | $gg \rightarrow t \bar{t} g g g$ |
|----------|------------------------|----------------------------|----------------------------------|
| madevent | 13G                    | 470G                       | 11T                              |
| matrix1  | 3.1G (23%)             | 450G (96%)                 | 11T(>99%)                        |

Our main focus is the ME calculation: develop new CUDA implementation (and speed up existing C++)



A. Valassi – Reengineering Madgraph5\_aMC@NLO for GPUs and vector CPUs

vCHEP – 19 May 2021

8



#### Standalone CUDA/C++ application VS. MadEvent integration

- Our main focus: the ME calculation in CUDA/C++ (sigmakin kernel/function)

   Design approach: single source code for CUDA and C++ (>90% common code + #ifdef's)
- Our workhorse: a simplified CUDA/C++ toy framework to feed events to the ME kernel
  - All 3 main components on the GPU: random (cuRAND), sampling (RAMBO), ME (sigmakin)
  - Fast, same results in GPU/CPU, but not good for production (RAMBO algorithm is inefficient)
  - The results I present in this talk come from this framework





#### **Event-level parallelism in practice – coding and #events**

- Easier to code for GPU SIMT than for CPU SIMD: CUDA code was faster to prototype
- CUDA (GPU) implementation
  - For SIMT, event loop is "orthogonal": one thread = one event (GPU thread ID ↔ event ID)
  - For SIMT, SOA memory layouts are beneficial (coalesced access), but not strictly essential
- C++ (CPU) implementation
  - For SIMD, event loop must be the innermost loop (e.g. invert helicity and event loops)
  - For SIMD, SOA memory layouts in the computational kernel are essential
- To be efficient, CUDA needs O(10k)-O(1M) events in parallel much more than C++!
  - CUDA: lockstep within each warp (32 threads) + many warps in parallel to fill the GPU
  - C++: lockstep within a vector register (2-8 doubles) + multi-threading or multi-processing





#### CUDA: Host(CPU)-to/from-Device(GPU) data copy has a cost

- In our standalone application (all on GPU): momenta, weights, MEs D-to-H

   Plots below from Nvidia Nsight Systems: 12 iterations with 524k events in each iteration
- Eventually, MadEvent on CPU + MEs on GPU: momenta H-to-D; MEs D-to-H
- The time cost of data transfers is relatively high in simple processes
  - ME calculation on GPU is fast (e.g. e<sup>+</sup>e<sup>-</sup>→µ<sup>+</sup>µ<sup>-</sup>: 0.4ms ME calculation ~ 0.4ms ME copy)
     Note: our ME throughput numbers are (number of MEs) / (time for ME calculation + ME copy)





#### **CPU throughput results (2)** Double, C++ – Scalar vs SIMD

- SIMD: excellent speedup from vectorization
  - NB: only measuring the parallel calculation
  - Lower overall speedup (Amdahl's law...)
- Best throughput: AVX512 limited to 256-bit width
  - -x3.7 over scalar C++ (vs x4 theoretical maximum)
    - Estimate a x3.3 speedup over scalar Fortran
  - Thanks to Sebastien Ponce for the suggestion!
- Disappointing: AVX512 with 512-bit width
  - Slower than AVX2, why? Slower clock, what else?
  - Can be improved? x8 theoretical maximum...

| # Symbols in .o | SSE4.2 | AVX2  | AVX512 | AVX512 |
|-----------------|--------|-------|--------|--------|
| Build type      | (xmm)  | (ymm) | (ymm)  | (zmm)  |
| Scalar          | 614    | 0     | 0      | 0      |
| SSE4.2          | 3274   | 0     | 0      | 0      |
| AVX2            | 0      | 2746  | 0      | 0      |
| 256-bit AVX512  | 0      | 2572  | 95     | 0      |
| 512-bit AVX512  | 0      | 1127  | 205    | 2045   |





Software performance and portability in Madgraph5\_aMC@NLO





Software performance and portability in Madgraph5\_aMC@NLO



#### Issue #2 **Data-parallel paradigms** (GPUs and vectorization)

Generators lend themselves naturally to exploiting event-level parallelism via data-parallel paradigms\*\*

- SPMD: Single Program Multiple Data (GPU accelerators)
- **SIMD**: Single Instruction Multiple Data (CPU vectorization: AVX...)
- The computationally intensive part, the matrix element  $f(\vec{x}_i)$ , is the same function for all events i (in a given category of events)
- Unlike detector simulation (where if/then branches are frequent and lead to thread divergence on GPUs)



- Faster (cheaper?) than on CPUs
- Exploit GPU-based HPCs



https://doi.org/10.5281/zenodo.4028834



A. Valassi – MC generators challenges and strategy towards HL-LHC

