Compute Accelerator Forum - HEP-CCE

Europe/Zurich
Virtual (Zoom)

Virtual

Zoom

Description

 

To receive annuoncements and information about this forum please subscribe to compute-accelerator-forum-announce@cern.ch

 

Zoom Meeting ID
69560339820
Host
Graeme A Stewart
Alternative hosts
Benjamin Morgan, Maria Girone, Thomas Nik Bazl Fard, Stefan Roiser
Useful links
Join via phone
Zoom URL

Compute Accelerator Forum Live Notes

Portable Parallelization Strategies

  • It is unloved, but I think that OpenCL should be included in the table in slide 9. [G.Amadio]

    • Just as long as we mark it with deep red colours in the table showing user friendliness. Since using it in a project like ATLAS reconstruction would be a nightmare. [A. Krasznahorkay]

  • On slide 9, about alpaka [A. Bocci]:

    • AMD GPU support is in production

    • [CL] Thanks Andrea!

  • On slide 9, about alpaka [B. Gruber]:

    • We have a prototype for SYCL using dpc++ and we can compile for Intel CPUs, GPUs and FGPAs. We also had a few successes with Xilinux FPGAs. The SYCL backend is currently under review and we expect to merge it before the next release (March 2022).

    • [CL] that's great news!

  • [Pascuzzi] Sl.9 RE: Python

  • Sl.9 [Wells] RE: Python

  • [Pascuzzi] Sl. 9 RE: Fortran (w/ OMP offload)

    • https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-cpp-fortran-compiler-openmp/top.html

  • [Pascuzzi] Sl. 9 RE: ARM

    • intel/llvm can compile for ARM but only on host (no OCL, for example), however Codeplay’s ComputeAorta may be a solution here, or through POCL

  • [A. Bocci] on slide 10, about not considering very important the ability to efficiently use GPUs with respect to native CUDA: looks like you are not taking into account the case where GPUs need to actually be paid for ?

    • [C. Leggett] But the idea is that if  you just write CUDA (or whatever) code that locks you into a single platform, you have 0 performance on any other architecture/system. I don't think that we can afford to ignore all the new HPCs that have AMD and Intel GPUS. For time critical systems like an online DAQ, the importance of the performance metric is much more important. It's not going to be a one size fits all solution.

  • [A. Bocci] on slide 12, regarding the Kokkos backend… well, yes, that’s one of the reasons CMS has chosen alpaka.

  • [J. Apostolakis] - General: Performance is not the only important metric. But it does matter. Are there testbeds in which we understanding the origin of the gap of performance between the portable (e.g. Kokkos) application and CUDA ?

    • [M. Lin] We have done some studies on the performance gap in Wire-Cell. Some discussions are presented in the vCHEP21 paper

    • [from discussion] Launch time, additional synchronisation, the different form of data structures for portability were also mentioned.

    • [Charles L.] (notes by JohnAp) in Paratrack some exploration was done; changing the Kokkos code to make it more similar to CUDA approach makes it faster on Nvidia GPU, but less portable (reduced ability to use on FPGAs or other architectures.) Please check how well this has been noted.

    • [C Leggett]: patatrack found a number of issues with Kokkos:

      • Support for CUDA streams

      • No equivalent of a caching allocator

      • Inability to interoperate with TBB

      • Overheads from initializing Kokkos data structures

  • [B. Morgan] On the build/integration side, what experiences are there in terms of guidelines for structuring code? For example, does/should the portability code only appear in implementation files, or if it has to be in headers, how to organise these. Guess this is really a question on CUDA device linking and ways/issues with building a project composed of several libraries (as opposed to executables) which might be header-only, or binary (inc device side interfaces) (Sorry I have to run to another meeting, but will catch up on an responses here, and also happy to discuss offline!).

    • [C. Leggett] for CUDA and Kokkos, basically all GPU kernels/code need to be completely visible in a single compilation unit if they call each other. Completely independent kernels can be in separate files however.  This required the use of wrapper files which is ugly. We see fewer restrictions for SYCL.

  • [Pascuzzi] Sl. 24 RE: strategies and metrics

    • Plan to have some level of integration into existing experiment frameworks?

    • Mostly up to the experiments, though we will guide with our experiences.

  • [Stephen] Idea for future work: investigating the performance of all these solutions using a formal performance portability metric (á la S. J. Pennycook et al.)

  • [J. Apostolakis] Comment on the ATLFast slide (maybe nitpicking): that the detailed Geant4 simulation of LAr calorimeters takes a ‘long’ time (per event) is due to two key factors, the larger one is the massive number of steps (mostly of lower energy tracks) that must be simulated, and the smaller one is the complexity of the geometry.

  • Regarding binary portability:

    • [Jeff Hammond] Why not just split app driver and backends into shared libraries so you can dlopen them ?

    • [Seth Johnson]:  @jeff that might be complicated from a build system perspective. I also think it's worth noting that even with CUDA alone, you need to either compile massive amounts of cubin code, or several different ptx versions and rely on slow JIT compilation -- depending on the GPU versions you're deploying to. And some codes also "assume" you're only using one CUDA arch target, which makes the toolchain infrastructure even trickier

There are minutes attached to this event. Show them.
    • 16:30 16:35
      News 5m
      Speakers: Benjamin Morgan (University of Warwick (GB)), Graeme A Stewart (CERN), Dr Maria Girone (CERN), Michael Bussmann (Helmholtz-Zentrum Dresden - Rossendorf), Stefan Roiser (CERN)
    • 16:35 17:05
      Fine-Grained I/O and Storage on HPC Platforms 30m
      Speakers: Peter Van Gemmeren (Argonne National Laboratory (US)), Saba Sehrish (Fermilab)
    • 17:05 17:35
      Portable Parellization Strategies 30m
      Speaker: Charles Leggett (Lawrence Berkeley National Lab (US))