Compute Accelerator Forum - Kokkos

Europe/Zurich
Virtual (Zoom)

Virtual

Zoom

Description

 

To receive annuoncements and information about this forum please subscribe to compute-accelerator-forum-announce@cern.ch

 

Videoconference
Compute Accelerator Forum
Zoom Meeting ID
69560339820
Host
Graeme A Stewart
Alternative hosts
Thomas Nik Bazl Fard, Benjamin Morgan, Maria Girone, Stefan Roiser
Useful links
Join via phone
Zoom URL

Kokkos, the C++ performance portability programming model

  • One of the big question marks (also with SYCL) is how much we can write universal algorithms. Finding in HEP code that our algorithms are difficult to work well on every hardware [Attila]

    • Use cases Kokkos has looked at generally work well back on CPU after migration to “universal/accelerator” case. ~60% of algorithms looked at are trivially portable [Christian]

    • One case in HEP is we don’t need to do same FP computation every time, e.g. lots of if statements [Attila]

    • … then then are 25-30% of algorithms that can be parallelized but require adaption for different architectures. E.g. number of rows in the example. “Under the hood” can have different strategies/approaches, e.g. data vs atomics. Sometimes it’s better to write two separate specialized algorithms than to deal with the complexity of a single universal one. These percentages are from O(200) codes using Kokkos[Christian]

    • [Jeff Hammond] “No one from Intel _who writes code_ ever claimed single-source performance portability for everything ;-)”

  • Given the SYCL backend, assume Kokkos is higher level, more features (as compared to SYCL itself) [Vincent]

    • One (flip!) aspect at the moment is that Kokkos is more stable than many compilers . Less flip answer is that SYCL doesn’t have complete vendor buy in just yet.  For example, Intel/Codeplay exposing all features of NVidia GPUs. Kokkos, through Sandia, do have good links with the vendors, e.g. Kokkos is a test case for their compilers/tools [Christian]

    • Follow up: measurement of performance portability vs native implementations [Vincent]

    • Have some measurement from mini-apps, get very close on GPUs. Lamdbas very import given prevalence in industry. Use of SIMD helps to recover some of the difference [Christain]

  • When your lambdas get bigger, you will be limited by register spilling, then the layout of your memory matters a lot (one indirection more and you get factor two performance drop). Do you optimize for this [Riccardo]?

    • Largest kernel we’ve seen is ~10000 lines after optimization. Going via registers or constant cache is important. Looking at register pressure helps. Work hard to avoid pointer chasing. Prefer CSR data structures,  [Christian].

  • Do you support rich data types such as struct, sum types, or native  structures of arrays [Riccardo]?

    • Yes, and any C++ data structure can be supported [Christain]

  • Could you, at least in principle, write a back end based on level zero or Opencl [Riccardo]?

    • Must have a C++ compiler at least. OpenCL isn’t really supported anymore by all vendors (Intel might not have OpenCL in SYCL). Don’t see a use case for level zero yet, but looking at SYCL backend for FPGAs [Christian]

    • Backend development is effectively writing a compiler [Attila]. Yes! [Christian]

  • You mentioned a lot atomics. We experienced massive performance drop if we emulate atomics where not available. Are your reductions really performance portable [Riccardo]?

    • On CPUs, yes atomics have issues, better on GPUs [Christian]

  • In performance portability space, C++ seems to be moving towards higher level abstractions. What could this mean for the future use of C++ here [Stephen]

    • C++ has a lot of advantages in writing models/abstractions in a performant way without having to write a new compiler. Given ecosystem, not just in HPC, and vendor involvement/investment, will be around for a while. But other languages are interesting, such as Julia. Vendors could come along here, but would be specialized for their use cases, which might overlap with ours. A counter example is the Fortran flang compiler developer [Christian]

There are minutes attached to this event. Show them.
    • 4:30 PM 4:35 PM
      News 5m
      Speakers: Benjamin Morgan (University of Warwick (GB)), Graeme A Stewart (CERN), Dr Maria Girone (CERN), Michael Bussmann (Helmholtz-Zentrum Dresden - Rossendorf), Stefan Roiser (CERN)
    • 4:35 PM 5:35 PM
      Kokkos, the C++ performance portability programming model 1h
      Speaker: Christian Trott (Sandia Nat. Lab.)