Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
Start: 2025-01-15T10:00:00+01:00
End: 2025-01-15T11:30:00+01:00
Location: No location set

Wednesday 15 Jan 2025, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  
  Discussion 20m
  
  Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
- 10:20 → 10:25
  
  Following up JIRA tickets 5m
  
  Speaker: Ernst Hellbar (CERN)
- 10:25 → 10:30
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Testing secondary vertexing. No significant difference found between both algorithms.
  
  Simulated: 10 Ev 50kHz PbPb + 100 K0S (boxgen) + 100 Lambda (boxgen)
  
  Secondary vertexing found 2888 V0s (GPU CF) / 2889 (NN)
  
  Mass histograms look basically identical. Left: NN, Right: GPU CF, x = reconstructed invariant mass calculated as Lambda in GeV (exact mass = 1.1157 GeV)
  
  No difference visible for K0 histograms, but mass peak is found (I would say). Exact mass = 0.4976 GeV
  
  Lambda mass peak is not visible for either algorithm
  
  More statistics necessary probably
  
  Currently working on matching efficiency -> Got workflow working, but need to check if NN was actually correctly applied
- 10:30 → 10:35
  
  ITS Tracking 5m
  
  Speaker: Matteo Concas (CERN)
- 10:35 → 10:45
  TPC Track Model Decoding on GPU 10m
  
  Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
  Global Parameter Optimisation
  
  Input dataset simulation
  
  Simulated several timeframes:
  
  pp: 100kHz, 200kHz, 500kHz, 1MHz, 2MHz
  
  PbPb: 10kHz, 15kHz, 20kHz, 27kHz, 35kHz, 42kHz, 47kHz, 50kHz
  
  Every timeframe simulated twice, one for 32 orbits timeframe and one for 128 orbits timeframe
  
  For the moment just one simulation per configuration (beam type - interaction rate - timeframe length)
  
  GPU Parameters study
  
  Focusing on grid and block size. Analysed the GPU workflow of the sync/async TPC processing. Image below is the workflow of two HIP streams of the sync TPC processing:
  
  By looking at the tracefile:
  
  Clusterizer chain:
  
  small concurrent kernels
  
  overlap during execution
  
  overall taking considerable time
  
  --> dependent parameters, global optimisation
  
  SliceTracker chain:
  
  medium concurrent kernels
  
  all streams used
  
  main kernel is TrackletConstructor
  
  trace file outputs that CreateSliceData takes a lot of time, however --debug does not say so, still investigating
  
  trace file outputs "Marker" which is not present in nvidia trace files, still investigating
  
  --> dependent parameters, global optimisation
  
  Merger chain:
  
  MergeBorders_step2: lots of small concurrent kernels, concurrent to a limited set of other one stream kernels --> dependent parameters, global optimisation (within set)
  
  SliceRefit: lots of small one stream kernels --> independent parameters, local optimisation
  
  MergerTrackFit: one stream long kernel --> independent parameters, local optimisation (maybe limited since values dependent also on number of tracks)
  
  MergerFollowLoopers: one stream medium kernel --> independent parameters, local optimisation
  
  Compression/Decompression chain:
  
  One stream kernels --> independent parameters, local optimisation
  
  Multiple stream kernels, not overlapping --> independent parameters, local optimisation
  
  Optimisation strategy
  
  For the moment just a "Manual Trial-and-Error" using observations from the output
  
  Started from MergerTrackFit, why:
  
  Long kernel
  
  One stream
  
  Not concurrent to any other kernels
  
  Caveat: grid size dependent on number of tracks
  
  Changing values in GPUDefGPUParameters.h takes a loooong time to compile, even with standalone benchmark
  
  Currently forcing custom krnlExec object in kernel calls, e.g.:
  
  runKernel<KernelClass, KernelClass::step>({{n_blocks,n_threads,stream}});
  
  Not handy, but way faster
  
  Created script that automatically fetches grid and block size for all the kernels, useful for runtime grid/block numbers like GetGrid(Merger.NOutputTracks(), 0, deviceType)
  
  Possible bug spotted
  
  HIP_AMDGPUTARGET set to "default" in GPU/GPUTracking/Standalone/cmake/config.cmake translates in HIP_AMDGPUTARGET=gfx906;gfx908 and forces to use MI50 params
  
  Basically here HIP_AMDGPUTARGET=gfx906;gfx908 enters the first if clause for MI50 even if I am compiling for MI100. Commented set(HIP_AMDGPUTARGET "default") on the config.cmake of the standalone benchmark and forced usage of MI100 parameters via
  
  cmake -DCMAKE_INSTALL_PREFIX=../ -DHIP_AMDGPUTARGET="gfx908" ~/alice/O2/GPU/GPUTracking/Standalone/
  
  Did not investigate further on this.
  
  Possible ideas for post manual optimization
  
  Isolate the parameters which are dependent, i.e. kernels from the same task which run in parallel (e.g. Clusterizer chain)
  
  Apply known optimization techniques to such kernel groups
  
  Grid/random search
  
  Bayesian optimization?
  See: F.-J. Willemsen, R. Van Nieuwpoort, and B. Van Werkhoven, “Bayesian Optimization for auto-tuning GPU kernels”, International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) at Supercomputing (SC21), 2021. Available: https://arxiv.org/abs/2111.14991
- 10:45 → 10:55
  
  Efficient Data Structures 10m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Global Parameter Optimisation

Input dataset simulation

GPU Parameters study

Optimisation strategy

Possible bug spotted

Possible ideas for post manual optimization