Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
    • 10:20 AM 10:25 AM
      Following up JIRA tickets 5m
      Speaker: Ernst Hellbar (CERN)
    • 10:25 AM 10:30 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      New developments:

      • Added local occupancy estimator (this was one request by GSI in the group meeting):
        vector<float> loc_occ(all(pad_time_bins), 0);
        for every pad_time_bin in tpc_sector{
        for pad_time_bin_2 in [time(pad_time_bin) - 20, time(pad_time_bin) + 20] and all_pads_in_ROC(pad_time_bin){
        loc_occ[pad_time_bin] += (int)(pad_time_bin_2 > 0);
        }
        loc_occ[pad_time_bin] /= (all_pads_in_ROC(pad_time_bin) * 41);
        }
      • Distortion simulation working now: full-system-test, distortion type 2 + slow-realisitic-full-sim + runnumber=311000
      • Got ITS-TPC matching + QC to work using the reco_NOGPU.log (WORKFLOWMODE="print")
      • Simulated different IR's and collision systems now (PbPb: 50kHz, 30kHz, 15kHz; pp: 1500kHz, 1000kHz, 500kHz)
      • Created dedicated dataset with loopers: 1 Ev. Pb-Pb 500kHz +  lowneut=true = many loopers at high time-bins 
      • Created dedicated simulation for overlap training data: 5 Ev. Pb-Pb 200 kHz should give a high local occupancy and a lot of overlapping tracks (to be seen if it is locally realisitic)

       

      Performed investigation on spacecharge distorted data.

      • Tested NN's trained on not-spacecharge-distorted data to see if they are good enough or need retraining.
      • Tested different input sizes and (NN classifciation + NN regression) vs. (NN classification + GPU CF regression) vs. (pure GPU CF regression)
      • Overall results:
        • NN classification + NN regression:
          • Regression does not work (well), bands are significantly broader or look like noise. Needs retraining. Was basically expected.
          • When applying all corrections, charge offset of native clusterizer does not appear anymore (investigating)
        • NN classification + GPU CF:
          • Similar effects as in non-distorted data: ~10-20% reduction of clusters. Bands look virtually identical for all variables (CoG pad and time, sigma, etc.) between both algorithms -> So classification seems to be somewhat robust to SC distortion

       

      Plan ahead

      • Retrain NN's with SC-distorted data and perform QA again
      • Streamline everything more. Currently, many manual steps -> focus on automatization
    • 10:30 AM 10:35 AM
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)
    • 10:35 AM 10:45 AM
      TPC Track Model Decoding on GPU 10m
      Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))

      GPU param optimisation

      Setup

      Measured on Alma 9.4, ROCm 6.3.1, MI50 GPU

      Executed grid search for the following kernels:

      • MergerTrackFit
      • MergerFollowLoopers
      • MergerSliceRefit
      • MMergerCollect
      • CompressionKernels_step0attached
      • CompressionKernels_step1unattached
         

      These are the longest single stream kernels. Parameters are independent, so easier to optimise. Custom search space for every kernel (for some can't have large block sizes).

      Each mean time is normalised to the mean time of the current (block_size, grid_size) configuration. So < 1 mean a better configuration, > 1 means worse and = 1 equal perfomance as current.

      MergerTrackFit

      Executed two times (Merger 1 and Merger 2)

      pp

      Merger 1

      • Low IR same performance as normal configuration (grid size dependent on number of tracks)
      • High IR same as low IR, except for (64,240) where it also has the same performance as normal

      Merger 2

      • Low and High IR sync benefits from bigger grid sizes
      • High IR async is 34% faster with higher grid sizes than current configuration for async

      PbPb

      Merger 1

      • Larger grid sizes almost reaches current configuration (grid_size * block_size >= n_tracks)

      Merger 2

      • Low IR can be 10% faster with bigger grid sizes
      • High IR is 40% faster with bigger grid sizes

      MergerSliceRefit

      Kernel is executed 36 times (once per TPC sector).

      • pp low IR benefits from lower block sizes
      • pp high IR benefits from larger grid and block sizes
      • PbPb low IR better with lower block sizes
      • PbPb high IR better with larger grid and block sizes

      MergerCollect

      pp

      Must retake some measurments due to some unkown problems. Overall best performance given by (64, 960), while current configuration is (512,60).

      PbPb

      Roughly same as pp

      MergerFollowLoopers

      Best configuration uses 900 or 960 as grid size. Current configuration is (256,200).

      Compression kernels

      Step 0 attached clusters

      No significant improvements when changing grid and block sizes.

      Step 1 unattached clusters

      For High IR, (192,180) shows better performances compared to current configuration (512,120).

      Grid search script

      Since these kernels are not executed concurrently, their parameters are independent. Hence, a python script to perform multiple grid searches at once has been created:

      1. A custom grid search space is defined for each kernel
      2. At each iteration, take a new space point, i.e. (block_size,grid_size), from each search space
        1. Modify (automatically) the code, plugging each new configuration into the correspondent kernel call in O2
        2. Compile
        3. Execute and measure kernels timings
        4. Iterate until the largest search space is exhausted. Skip new point sampling if search space has been completely explored.

      Pros: Multiple grid searches possible per single run

      Cons: Works effectively only with non concurrent kernels

      Next things to do

      • Try to perform grid search on long kernels with other concurrent kernels
      • Determine a way to asses what configuration is the best after grid search (instead of just looking at the heatmaps)
      • Create a set of optimum parameters based on beamtype and IR
      • Explore best parameters with other IRs
      • Explore if best parameters change for different datasets with same beamtype and IR
    • 10:45 AM 10:55 AM
      Efficient Data Structures 10m
      Speaker: Dr Oliver Gregor Rietmann (CERN)