Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
    • 10:20 AM 10:25 AM
      Following up JIRA tickets 5m
      Speaker: Ernst Hellbar (CERN)
    • 10:25 AM 10:30 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
    • 10:30 AM 10:35 AM
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)
    • 10:35 AM 10:45 AM
      TPC Track Model Decoding on GPU 10m
      Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))

      Global Parameter Optimisation

      Context:

      Tried manual tuning of GMMergerTrackFit. This kernel is called twice:

      1. First with
        1. block size: 128
        2. grid size s.t. grid size*block size >= #tracks
      2. Second with
        1. block size: 128
        2. grid size: 120

      The two mergers are located here in the GPUChain (sync chain in the image below):

      Tuning approach:

      Used same configuration for both kernels (instead of two separate configurations). Kept 128 threads per block, increased block size: 120 * {1,2,3,4,5,6,7}

      Results:

      Tested on MI100.

      Keep in mind: in the following plots "Normal" for Merger 1 means grid size s.t. grid size*block size >= #tracks. In practice:

      • grid size = 492 for pp 100kHz
      • grid size = 10907 for pp 2MHz
      • grid size = 1795 for PbPb 5kHz
      • grid size = 19709 for PbPb 50kHz

      pp, sync

      • First merger benefits from large block sizes, but it seems to reach normal configuration at 840 blocks, no need to scale grid size up to 10 thousands
      • Second merger benefits from larger block sizes than normal (120 blocks)

      pp, async

      More or less same result as sync for async merger 1 and 2

      PbPb, sync

      • For low IR, merger 1 seems to benefit with lower grid sizes (Normal for 5kHz is 1795), for high IR difficult to reach normal configuration (480 seems promising for both)
      • Merger 2 also benfits from bigger grid sizes for both IRs

      PbPb, async

      Same observations for the asynchronous reco as the sync

      Grid search

      Attempted a grid search approach on MI100. The parameter search span is defined as block_size = {32, 64, 128} and grid_size = {120, 240, 360, 480, 600, 840}.  Block size is a multiple of warp size (64). I put also 32 experimentally, to see what happens with a non-optimal block size. Grid size is a multiple of the number of Compute Units of the MI100 (120 CUs).

      Thus the parameter search space is {32, 64, 128} x {120, 240, 360, 480, 600, 840}.

      Heatmaps are plotted. Every mean execution time is normalised to the mean execution time with the current standard parameters. Hence:

      • cell < 1 (red cell) better configuration than current conf
      • cell = 1 (white cell) equal configuration than current conf
      • cell > 1 (blue cell) worse configuration than current conf
      pp

      For merger 1, both for low and high IRs and for sync and async, same performance are reached with the {128,840}configuration, instead of the dynamic configuration which results in {128,492} for 100kHz and {128, 10907} for 2MHz (based on #tracks).

      For merger 2, low IR seems to prefer smaller configurations, while for high IR bigger configurations works better. In any case there is room for improvement.

      PbPb

      For Merger 1, configuration {128,840} runs faster for low IR rather than {128,1795}, while for high IR the performance is equal ( w.r.t {128,19709}).

      Merger 2 can be leveraged better with several configurations.

      To-do:

      Based on these observations:

      • Take measurments also on MI50
      • Try even higher grid size 
      • Measure other kernels
        • Understand how to properly time kernels without serialize them
        • Investigate on the SliceTracker part (concurrent kernels)
    • 10:45 AM 10:55 AM
      Efficient Data Structures 10m
      Speaker: Dr Oliver Gregor Rietmann (CERN)