Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeakers: David Rohr (CERN), Giulio Eulisse (CERN)
-
10:20 AM
→
10:25 AM
Following up JIRA tickets 5mSpeaker: Ernst Hellbar (CERN)
-
10:25 AM
→
10:30 AM
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
-
10:30 AM
→
10:35 AM
ITS Tracking 5mSpeaker: Matteo Concas (CERN)
-
10:35 AM
→
10:45 AM
TPC Track Model Decoding on GPU 10mSpeaker: Gabriele Cimador (Universita e INFN Trieste (IT))
Global Parameter Optimisation
Context:
Tried manual tuning of GMMergerTrackFit. This kernel is called twice:
- First with
block size
: 128grid size
s.t.grid size*block size >= #tracks
- Second with
block size
: 128grid size
: 120
The two mergers are located here in the GPUChain (sync chain in the image below):
Tuning approach:
Used same configuration for both kernels (instead of two separate configurations). Kept 128 threads per block, increased
block size: 120 * {1,2,3,4,5,6,7}
Results:
Tested on MI100.
Keep in mind: in the following plots "Normal" for Merger 1 means
grid size
s.t.grid size*block size >= #tracks
. In practice:grid size = 492
for pp 100kHzgrid size = 10907
for pp 2MHzgrid size = 1795
for PbPb 5kHzgrid size = 19709
for PbPb 50kHz
pp, sync
- First merger benefits from large block sizes, but it seems to reach normal configuration at 840 blocks, no need to scale grid size up to 10 thousands
- Second merger benefits from larger block sizes than normal (120 blocks)
pp, async
More or less same result as sync for async merger 1 and 2
PbPb, sync
- For low IR, merger 1 seems to benefit with lower grid sizes (Normal for 5kHz is 1795), for high IR difficult to reach normal configuration (480 seems promising for both)
- Merger 2 also benfits from bigger grid sizes for both IRs
PbPb, async
Same observations for the asynchronous reco as the sync
Grid search
Attempted a grid search approach on MI100. The parameter search span is defined as
block_size = {32, 64, 128}
andgrid_size = {120, 240, 360, 480, 600, 840}
. Block size is a multiple of warp size (64). I put also 32 experimentally, to see what happens with a non-optimal block size. Grid size is a multiple of the number of Compute Units of the MI100 (120 CUs).Thus the parameter search space is
{32, 64, 128} x {120, 240, 360, 480, 600, 840}
.Heatmaps are plotted. Every mean execution time is normalised to the mean execution time with the current standard parameters. Hence:
cell < 1
(red cell) better configuration than current confcell = 1
(white cell) equal configuration than current conf
cell > 1
(blue cell) worse configuration than current conf
pp
For merger 1, both for low and high IRs and for sync and async, same performance are reached with the
{128,840}
configuration, instead of the dynamic configuration which results in{128,492}
for 100kHz and{128, 10907}
for 2MHz (based on #tracks).For merger 2, low IR seems to prefer smaller configurations, while for high IR bigger configurations works better. In any case there is room for improvement.
PbPb
For Merger 1, configuration
{128,840}
runs faster for low IR rather than{128,1795}
, while for high IR the performance is equal ( w.r.t{128,19709}
).Merger 2 can be leveraged better with several configurations.
To-do:
Based on these observations:
- Take measurments also on MI50
- Try even higher
grid size
- Measure other kernels
- Understand how to properly time kernels without serialize them
- Investigate on the SliceTracker part (concurrent kernels)
- First with
-
10:45 AM
→
10:55 AM
Efficient Data Structures 10mSpeaker: Dr Oliver Gregor Rietmann (CERN)
-
10:00 AM
→
10:20 AM