# Fri 30.07.2021 MG dev meeting
Present: SR, AV, Andy, Josh, Walter, Taylor
## Round table
Andy: thesis defense a few days ago.
Probably last meeting, thanks everybody!
Walter: ntr
Taylor:
- Had a kokkos hackathon organised by the the CCE.
Some discussion on benchmarking (ATLAS fastcalosim, CMS patatrack, madgraph) and cross-platform versions.
- Analysing why still some differences (~20% slower than cuda), probably main difference is memory definition and indexing
Noticed that the cuda version has a change by Olivier to reuse one amplitude in memory; tried that, hoping it improves performance, but it actually got it worse. Quite complex to profile CUDA at level of individual lines of code. AV: not sure even in CUDA if it improves!
> SR: about 20%, are you sure it's the same algorithm as in CUDA? TC: pretty sure it's apples to apples now!
> SR: would be nice to tag a golden cuda and a golden kokos version. TC: yes, but first it would be useful to sit with Andrea (when he is back from holidays!) to compare if there are still differences...
> Another puzzling observation is that kokkos seems to take more local memory.
Josh: not much, was also on holiday
- Was at a meeting of the UK body where software was also discussed
(STFC PPTAP Software & Computing WS: https://indico.stfc.ac.uk/event/331).
A lot of discussion on generators (big involvement in the UK, eg Durham),
we made a point this needs investment and the UK could play a role.
> SR: were GPUs discussed? JMF: yes in general software ports to GPUs, not only gen but also sim/rec.
> TC: would be interesting if the theory community had a specific library, a bit like numpy.
AV: agree, this is something that Josh and I often point out, easing modularity and reuse
(but it is higher level components than numpy - a bit hitting in the core business of theorists).
AV: interesting to hav esome practical discussion here while we design our internal APIs (example: may be relevant for Powheg).
Andrea:
- Ongoing useful discussion (github/email) with Steve Lantz from CMS/mkFit about AVX512 vectorization. Documented tests on Cori at NERSC. Repeating some tests on my usual Silver Xeon using gcc10.2, it is better than gcc9.2 on AVX512. Looking at pragma omp simd as a strategy for icc (Steve claims it is better than icx).
> Stefan: have some contacts with ARM HPC in the UK,will follow up
- Completed a few tests on AMD EPYC CPUs.
> Taylor: Kokkos actually gets a bad performance on AMD GPU devices.
- Prepared a docker container on madgraph4gpu for the HEPIX benchmarking WG (HEP-SCORE)
- Discussed with Stefan about splitting kernel into small kernels
- Will be on holiday for ~5 weeks
Laurence (on holiday now, via email to Stefan)
- Working with Intel on SYCL, he says they are now getting same performance as CUDA
> AV: on Nvidia GPUs or Intel GPUs? SR: cross checked, yes on Nvidia hardware
> SR: will invite Laurence and/or an Intel engineer to show this at a meeting
Stefan:
- Read that cuda 11.4 is out. Interesting thing is that ncu shows the register occupancy
> AV: installed, on a suggestion from SH, just do "yum install cuda-11-4"
- Looking at splitting the sigmakin kernel, had a dscussion with Andrea yesterday on this
## AOB
Next meeting?
- AV and TC Tuesday 7 Sep 3pm GVA time
- General meeting Tuesday 14 Sep 3pm GVA time
- Workshop on generation Thursday 16 Sep afternoon