Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeakers: David Rohr (CERN), Giulio Eulisse (CERN)
-
10:20 AM
→
10:25 AM
Following up JIRA tickets 5mSpeaker: Ernst Hellbar (CERN)
-
10:25 AM
→
10:30 AM
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- Testing secondary vertexing. No significant difference found between both algorithms.
- Simulated: 10 Ev 50kHz PbPb + 100 K0S (boxgen) + 100 Lambda (boxgen)
- Secondary vertexing found 2888 V0s (GPU CF) / 2889 (NN)
- Mass histograms look basically identical. Left: NN, Right: GPU CF, x = reconstructed invariant mass calculated as Lambda in GeV (exact mass = 1.1157 GeV)
- No difference visible for K0 histograms, but mass peak is found (I would say). Exact mass = 0.4976 GeV
- Lambda mass peak is not visible for either algorithm
- More statistics necessary probably
- Currently working on matching efficiency -> Got workflow working, but need to check if NN was actually correctly applied
-
10:30 AM
→
10:35 AM
ITS Tracking 5mSpeaker: Matteo Concas (CERN)
-
10:35 AM
→
10:45 AM
TPC Track Model Decoding on GPU 10mSpeaker: Gabriele Cimador (Universita e INFN Trieste (IT))
Global Parameter Optimisation
Input dataset simulation
Simulated several timeframes:
- pp: 100kHz, 200kHz, 500kHz, 1MHz, 2MHz
- PbPb: 10kHz, 15kHz, 20kHz, 27kHz, 35kHz, 42kHz, 47kHz, 50kHz
Every timeframe simulated twice, one for 32 orbits timeframe and one for 128 orbits timeframe
For the moment just one simulation per configuration (beam type - interaction rate - timeframe length)
GPU Parameters study
Focusing on grid and block size. Analysed the GPU workflow of the sync/async TPC processing. Image below is the workflow of two HIP streams of the sync TPC processing:
By looking at the tracefile:
- Clusterizer chain:
- small concurrent kernels
- overlap during execution
- overall taking considerable time
- --> dependent parameters, global optimisation
- SliceTracker chain:
- medium concurrent kernels
- all streams used
-
main kernel is TrackletConstructor
- trace file outputs that CreateSliceData takes a lot of time, however --debug does not say so, still investigating
- trace file outputs "Marker" which is not present in nvidia trace files, still investigating
- --> dependent parameters, global optimisation
- Merger chain:
-
MergeBorders_step2: lots of small concurrent kernels, concurrent to a limited set of other one stream kernels --> dependent parameters, global optimisation (within set)
-
SliceRefit: lots of small one stream kernels --> independent parameters, local optimisation
- MergerTrackFit: one stream long kernel --> independent parameters, local optimisation (maybe limited since values dependent also on number of tracks)
-
MergerFollowLoopers: one stream medium kernel --> independent parameters, local optimisation
-
- Compression/Decompression chain:
- One stream kernels --> independent parameters, local optimisation
- Multiple stream kernels, not overlapping --> independent parameters, local optimisation
Optimisation strategy
- For the moment just a "Manual Trial-and-Error" using observations from the output
- Started from MergerTrackFit, why:
- Long kernel
- One stream
- Not concurrent to any other kernels
- Caveat: grid size dependent on number of tracks
- Changing values in GPUDefGPUParameters.h takes a loooong time to compile, even with standalone benchmark
- Currently forcing custom krnlExec object in kernel calls, e.g.:
runKernel<KernelClass, KernelClass::step>({{n_blocks,n_threads,stream}});
- Not handy, but way faster
- Currently forcing custom krnlExec object in kernel calls, e.g.:
- Created script that automatically fetches grid and block size for all the kernels, useful for runtime grid/block numbers like GetGrid(Merger.NOutputTracks(), 0, deviceType)
Possible bug spotted
HIP_AMDGPUTARGET set to "default" in GPU/GPUTracking/Standalone/cmake/config.cmake translates in HIP_AMDGPUTARGET=gfx906;gfx908 and forces to use MI50 params
Basically here HIP_AMDGPUTARGET=gfx906;gfx908 enters the first if clause for MI50 even if I am compiling for MI100. Commented set(HIP_AMDGPUTARGET "default") on the config.cmake of the standalone benchmark and forced usage of MI100 parameters via
cmake -DCMAKE_INSTALL_PREFIX=../ -DHIP_AMDGPUTARGET="gfx908" ~/alice/O2/GPU/GPUTracking/Standalone/
Did not investigate further on this.
Possible ideas for post manual optimization
- Isolate the parameters which are dependent, i.e. kernels from the same task which run in parallel (e.g. Clusterizer chain)
- Apply known optimization techniques to such kernel groups
- Grid/random search
- Bayesian optimization?
See: F.-J. Willemsen, R. Van Nieuwpoort, and B. Van Werkhoven, “Bayesian Optimization for auto-tuning GPU kernels”, International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) at Supercomputing (SC21), 2021. Available: https://arxiv.org/abs/2111.14991
-
10:45 AM
→
10:55 AM
Efficient Data Structures 10mSpeaker: Dr Oliver Gregor Rietmann (CERN)
-
10:00 AM
→
10:20 AM