Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
Start: 2025-01-29T10:00:00+01:00
End: 2025-01-29T11:30:00+01:00
Location: No location set

Wednesday 29 Jan 2025, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  
  Discussion 20m
  
  Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
- 10:20 → 10:25
  
  Following up JIRA tickets 5m
  
  Speaker: Ernst Hellbar (CERN)
- 10:25 → 10:30
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  New developments:
  
  Added local occupancy estimator (this was one request by GSI in the group meeting):
  vector<float> loc_occ(all(pad_time_bins), 0);
  for every pad_time_bin in tpc_sector{
  for pad_time_bin_2 in [time(pad_time_bin) - 20, time(pad_time_bin) + 20] and all_pads_in_ROC(pad_time_bin){
  loc_occ[pad_time_bin] += (int)(pad_time_bin_2 > 0);
  }
  loc_occ[pad_time_bin] /= (all_pads_in_ROC(pad_time_bin) * 41);
  }
  
  Distortion simulation working now: full-system-test, distortion type 2 + slow-realisitic-full-sim + runnumber=311000
  
  Got ITS-TPC matching + QC to work using the reco_NOGPU.log (WORKFLOWMODE="print")
  
  Simulated different IR's and collision systems now (PbPb: 50kHz, 30kHz, 15kHz; pp: 1500kHz, 1000kHz, 500kHz)
  
  Created dedicated dataset with loopers: 1 Ev. Pb-Pb 500kHz + lowneut=true = many loopers at high time-bins
  
  Created dedicated simulation for overlap training data: 5 Ev. Pb-Pb 200 kHz should give a high local occupancy and a lot of overlapping tracks (to be seen if it is locally realisitic)
  
  Performed investigation on spacecharge distorted data.
  
  Tested NN's trained on not-spacecharge-distorted data to see if they are good enough or need retraining.
  
  Tested different input sizes and (NN classifciation + NN regression) vs. (NN classification + GPU CF regression) vs. (pure GPU CF regression)
  
  Overall results:
  
  NN classification + NN regression:
  
  Regression does not work (well), bands are significantly broader or look like noise. Needs retraining. Was basically expected.
  
  When applying all corrections, charge offset of native clusterizer does not appear anymore (investigating)
  
  NN classification + GPU CF:
  
  Similar effects as in non-distorted data: ~10-20% reduction of clusters. Bands look virtually identical for all variables (CoG pad and time, sigma, etc.) between both algorithms -> So classification seems to be somewhat robust to SC distortion
  
  Plan ahead
  
  Retrain NN's with SC-distorted data and perform QA again
  
  Streamline everything more. Currently, many manual steps -> focus on automatization
- 10:30 → 10:35
  
  ITS Tracking 5m
  
  Speaker: Matteo Concas (CERN)
- 10:35 → 10:45
  TPC Track Model Decoding on GPU 10m
  
  Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
  GPU param optimisation
  
  Setup
  
  Measured on Alma 9.4, ROCm 6.3.1, MI50 GPU
  
  Executed grid search for the following kernels:
  
  MergerTrackFit
  
  MergerFollowLoopers
  
  MergerSliceRefit
  
  MMergerCollect
  
  CompressionKernels_step0attached
  
  CompressionKernels_step1unattached
  
  These are the longest single stream kernels. Parameters are independent, so easier to optimise. Custom search space for every kernel (for some can't have large block sizes).
  
  Each mean time is normalised to the mean time of the current (block_size, grid_size) configuration. So < 1 mean a better configuration, > 1 means worse and = 1 equal perfomance as current.
  
  MergerTrackFit
  
  Executed two times (Merger 1 and Merger 2)
  
  pp
  
  Merger 1
  
  Low IR same performance as normal configuration (grid size dependent on number of tracks)
  
  High IR same as low IR, except for (64,240) where it also has the same performance as normal
  
  Merger 2
  
  Low and High IR sync benefits from bigger grid sizes
  
  High IR async is 34% faster with higher grid sizes than current configuration for async
  
  PbPb
  
  Merger 1
  
  Larger grid sizes almost reaches current configuration (grid_size * block_size >= n_tracks)
  
  Merger 2
  
  Low IR can be 10% faster with bigger grid sizes
  
  High IR is 40% faster with bigger grid sizes
  
  MergerSliceRefit
  
  Kernel is executed 36 times (once per TPC sector).
  
  pp low IR benefits from lower block sizes
  
  pp high IR benefits from larger grid and block sizes
  
  PbPb low IR better with lower block sizes
  
  PbPb high IR better with larger grid and block sizes
  
  MergerCollect
  
  pp
  
  Must retake some measurments due to some unkown problems. Overall best performance given by (64, 960), while current configuration is (512,60).
  
  PbPb
  
  Roughly same as pp
  
  MergerFollowLoopers
  
  Best configuration uses 900 or 960 as grid size. Current configuration is (256,200).
  
  Compression kernels
  
  Step 0 attached clusters
  
  No significant improvements when changing grid and block sizes.
  
  Step 1 unattached clusters
  
  For High IR, (192,180) shows better performances compared to current configuration (512,120).
  
  Grid search script
  
  Since these kernels are not executed concurrently, their parameters are independent. Hence, a python script to perform multiple grid searches at once has been created:
  
  A custom grid search space is defined for each kernel
  
  At each iteration, take a new space point, i.e. (block_size,grid_size), from each search space
  
  Modify (automatically) the code, plugging each new configuration into the correspondent kernel call in O2
  
  Compile
  
  Execute and measure kernels timings
  
  Iterate until the largest search space is exhausted. Skip new point sampling if search space has been completely explored.
  
  Pros: Multiple grid searches possible per single run
  
  Cons: Works effectively only with non concurrent kernels
  
  Next things to do
  
  Try to perform grid search on long kernels with other concurrent kernels
  
  Determine a way to asses what configuration is the best after grid search (instead of just looking at the heatmaps)
  
  Create a set of optimum parameters based on beamtype and IR
  
  Explore best parameters with other IRs
  
  Explore if best parameters change for different datasets with same beamtype and IR
- 10:45 → 10:55
  
  Efficient Data Structures 10m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

GPU param optimisation

Setup

MergerTrackFit

pp

Merger 1

Merger 2

PbPb

Merger 1

Merger 2

MergerSliceRefit

MergerCollect

pp

PbPb

MergerFollowLoopers

Compression kernels

Step 0 attached clusters

Step 1 unattached clusters

Grid search script

Next things to do