Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 12:45 13:05
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Need to check ROCm 7.2 corrtecness.
      • Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • Check for unnecessary f64 instructions in GPU code.

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
      • Test at NERSC successfull, first results presented at CHEP and LHCP.
        • GPU TPC + ITS tracking working nicely on A100 GPU, no backpressure from GPU, but workflow is CPU-bound, as on EPNs.
        • For some reasons, the GPUs at NERSC perform much slower than on EPNs (even though both are 64 cores), should be investigated.

       

      EPN GPU Topics:

       

    • 13:05 13:10
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
    • 13:10 13:15
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      Slightly changed strategy for incorrect kernel configurations:

      1. Query GPU constants and limits:
        warp size, max threads / SM, registers / SM, shared mem / SM  & block, max blocks / SM
      2. Define parameters range:
        Block size: typically from warp size to 20 * warp size
        Min blocks/SM: sampled as fraction of SM capacity ∈ [0, 1]
      3. Loop:
        1. Sample parameters from Optuna study
        2. RT-compile with sampled parameters; min_blocks_per_sm = 1
        3. Inspect fatbins files to query kernels resource usage:
          Function krnl_GPUTPCTrackletSelector:
            REG:64 STACK:600 SHARED:8976 LOCAL:0 CONSTANT[0]:900 TEXTURE:0 SURFACE:0 SAMPLER:0
        4. Compute per kernel max_blocks_per_sm based on resource constraints
        5. RT-compile with min_blocks_per_sm = sm_fraction × max_blocks_per_sm
        6. Measure tuned step duration and report result to Optuna study

       

      This way ensures always good configurations

      Except when block size exceeds limits due to static shared memory dependent on it, example:

      class GPUTPCTrackletSelector : public GPUKernelTemplate
      {
       public:
        struct GPUSharedMemory {
          int32_t mItr0;          // index of the first track in the block
          int32_t mNThreadsTotal; // total n threads
          int32_t mNTracklets;    // n of tracklets
          int32_t mReserved;      // for alignment reasons
          static_assert(GPUTPCGeometry::NROWS >= GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE);
          GPUTPCHitId mHits[GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE][GPUCA_GET_THREAD_COUNT(GPUCA_LB_GPUTPCTrackletSelector)];
        };

      Thus if the first RT compilation fails:

      1. Logs from compilation are inspected
      2. Failing kernels identified
      3. Steps with failing kernels are marked as bad, RT again without those steps
      4. Update per kernel cache with max block size that can be sampled

       

      SUCCESS: Tuner now finds optimal sets within GPU limits

    • 13:15 13:20
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
       

      NextGenTrigger Task 1.7

      • CHEP talk together with Jolly got accepted.
      • Co-Supervision of a summer student. Topic: Imrpove clustering algorithm CLUE.

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Added GPU benchmarks to the repo of synthetic benchmarks for our SoA library.
      • Implemented reflections approach in O2 standalone benchmark.
      • Prepared CHEP 2026 presentation.
      • Next Steps:
        • Make better use of SoA to improve performance
        • Try David's suggestion
    • 13:20 13:25
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 13:25 13:30
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      CI Server: Sergio finished CI integration.

      Looking into setting up workflow to run standalone GPU tool.

      Need new job variant ci-jobs? Repo is private.

      Highly Ionizing Particles

      Biggest changes since last meeting:

      • Fixed issue where filter missed saturated bins (-> All bins with ADC == 1023 are now cleared)
      • Added rising edge detection
      • Added exponential filter for tail cutoff
      • Cluster creation is now working

      Filter Performance

      Rough parameter sweep over threshold and exp filter alpha:

      • Performance increased to 60% - 70% depending on tolerated false positive rate.
      • Filter performance looks better when checking #ADC instead of #digit cleared. -> Reach +80% of ADCs cleared 
      • Caveats: cluster pad width appears to be a fat tail distribution. Filter currently goes +- 5 pads, but ~10% digits and ADC outside of that window and never seen by filter (=> SOL is 90% for filter performance)
      • Low alpha (slow update rate): Higher recall, but false positive rate explodes

       

       

      Cluster Creation

      Short example for cluster creation. Also shows failure modes of HIP filter.

      MC:

      Cluster + MC:

      Runtime

      Cluster creation currently parallel across rows. Still need some idea how to parallelize across tails, or at least make tail -> cluster matching faster single threaded.

    • 13:30 13:35
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
    • 13:35 13:40
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)