Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test of new O2DPG was set up incorrectly, now seems to work. Catalin wants to do some extrra checks before merging.
      • Test with GPU GRID jobs at NERSC pending.
      • Asked DPG to run first test with ITS tracking on GPU on EPNs.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • LLVM Bump to 20.1: status?
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • ROCm 7 with full serialization passes validation in deterministic mode.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
      • Pending OpenCL2 issues:
        • printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
        • Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
        • Also with optimization disabled, crashing later in TPC merging, need printf to debug.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • Probably a bug in multi-threaded pipeline when timeframes do not arrive in order, trying to reproduce.
      • Bug in reading of MC data, temporary fix by Ruben applied, will need to implement a proper fix.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Physics

      • Reran on partial set of digits from https://alimonitor.cern.ch/catalogue/index.jsp?path=%2Falice%2Fsim%2F2025%2FLHC25b8a_v8b3c%2F0%2F559781#/alice/sim/2025/LHC25b8a_v8b3c/0/559781
      • Produced own anchored simulation for the 0-5% centrality simulations (finished yesterday night with 10 sims with and w/o SC)

       

      Comparison: Left: Old, 50Ev, 50kHz PbPb without proper correction; Right: New sim from grid, no centrality enforced (40x statistics)

      (For completeness: The qa plots on the left were made with different thresholds between the NN and GPU CF regressions. This explains the strong difference. It is not related to a "worse" behaviour of the regression net.)

      Clusters fake fraction

      Primary tracks: Efficiency

      Primary tracks: Fake rate

      New plots (only for the new sim for now)

      Chi2_red for good tracks

      Chi2_red for fake tracks

      Z resolution (improves with network -> Better CoG time estimate)

      Tracks vs. RowsWClusters

      Tracks vs. pT (full)

      Tracks vs. pT (ratio)

      Ratio of pT RMS (improvement at low pT, otherwise compatible)

      Framework
      - Currently debugging tpcdigits.root writing from the o2-tpc-reco-workflow and the ChunkedDigitWriter. If that fails still by noon I will switch to custom dumping

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      HIP/CUDA __launch_bounds__ mismatch problem

      • Almost fixed mismatch with HIP and CUDA launch bounds
      • Strategy is to edit the second parameter if the length of the tuple is >= 2
      • Atm able to split strategy and deal with kernels with 1 parameter
      • Currently stuck for kernels with parameters >= 2, can't expand last macro to inject launch bounds in the kernel definition.
        E.g., preprocessor output for a kernel with 2 params:
         __attribute__((global)) void GPUCA_KRNL_REG2 1024, 1 krnl_GPUTPCNeighboursFinder( int32_t _iSector_internal ) {...

      ITS-GPU-tracker parameter tuning

      • Added launch bounds to 6 kernels
      • Original version: nThreads=256, nBlocks=30
      • Changed to nThreads=256, nBlocks=60 to have one block per SM (total of 60 SM on MI50)
      • Set __launch_bounds__(256, 1) for those 6 kernels
      • Processed 1091 pp TFs (alien:///alice/data/2025/LHC25ac/563430/)
      • Measured the following metric from the tracker: TimeFrame 1091 processing completed in: 300.93 ms using 20 thread(s)

        Mean TF processing times [ms] on AMD MI50 No __launch_bounds__ With __launch_bounds__ Gain Speedup
        Before  overlap memcpy with compute kernels 853.1 ± 191.9 ms 561.2 ms ± 155.5 34.2 % 1.5
        After     overlap memcpy with compute kernels 651.6 ms ± 167.2 454.6 ms ± 137.0 ms 30.2% 1.4
      • This with simple manual tuning. If this metric is valid, I will use the tuner to look for best configurations for the single kernels.
      • Will have to think how to deal with different GPU architectures, as these parameters are architecture dependent (suggestions?)

      ALICE contribution to HS23

      Currently on hold, given priority to __launch_bounds__ fix and ITS tuning, Robin is in vacation anyway.

    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      Summer Student Milla Bramsted

      • She is working on benchmarking SoA code on GPUs
      • We will track her project in this google doc
      • She is adding CUDA kernels to this repo
      • 5 CUDA kernels are now running in our benchmark framework
      • They are running in AoS and SoA data layout
      • Milla presented her project at the AIP meeting
      • This is her last week
      • Next steps:
        • Finish the written report

      ALICE O2 CI-Pipelines on NGT Cluster

      • A fork of the AliceO2 repo is not in the  NextGenTrigggers (NGT) GitHub organization
      • It has a GitHub action running the standalone benchmark on NGT GPUs
      • Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
      • Different GPUs are tested in parallel on different VMs
      • O2 standalone benchmark works on the all ngt-resources:
        • Nvidia H100 188GB NVL
        • AMD Instinct MI300X
        • AMD Radeon Pro W7900
        • Nvidia L40S
      • We are now using custom .par files
      • Next steps
        • Generate optimized .par files with Gabriele
      • Possible next steps
        • Add new architectures to O2 hardcoded ones?
        • Store results in csv format and plot them?
        • Display the plot in the github web gui?

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo
      • Simplified the AoS-SoA wrapper code
      • Started to apply our AoS-SoA code to: 
        • Merger
          • GPUTPCGMSectorTrack
          • GPUTPCGMTrackParam 
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTrack
      • Started with SectorTracker, no problems until now

      ACAT Conference Preparation

      • Presenting with Jolly our joint work on AoS vs SoA.
      • Added some manual AoS benchmarks as baselines.
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

      Build the O2 GPU TPC Standalone benchmark on two local machines:-

      Motivation:- Want to reduce my uses of EPN for routine activities like To generate PRs and learning the framework.

       

      On my local workstation:- Copied the David generated DataSets from EPN machine, but shows size mismatch error. 
       
       
      vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pp-10 --debug 1
      Reading events from Directory o2-pp-10
      Created GPUReconstruction instance for device type CPU (1)
      ERROR reading events/o2-pp-10/tpctransform.dump, invalid size: 4552 (4568 expected)
      terminate called after throwing an instance of 'std::runtime_error'
        what():  invalid size
      Aborted
      vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pbpb-50 --debug 1
      Reading events from Directory o2-pbpb-50
      Created GPUReconstruction instance for device type CPU (1)
      ERROR reading events/o2-pbpb-50/tpctransform.dump, invalid size: 4552 (4568 expected)
      terminate called after throwing an instance of 'std::runtime_error'
        what():  invalid size
      Aborted
      vikas@vsinghal:/localdata/standalone$

       

      Checked the dump files with md5sum, there is no difference.
      Checked the OS versions for EPN and my setups. I have two setups One with Debian and one with AlmaLinux but not worked for both. 

       

      vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ diff md5EpnO2-pp-10 md5VikasO2-pp-10 
      vikas@vsinghal:/localdata/standalone/events/o2-pp-10$
      
      [vsinghal@epn000 o2-pp-10]$ lsb_release -a
      LSB Version:    :core-4.1-amd64:core-4.1-noarch
      Distributor ID:    AlmaLinux
      Description:    AlmaLinux release 9.5 (Teal Serval)
      Release:    9.5
      Codename:    TealServal
      [vsinghal@epn000 o2-pp-10]$
      
      vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ lsb_release -a
      No LSB modules are available.
      Distributor ID:    Debian
      Description:    Debian GNU/Linux 11 (bullseye)
      Release:    11
      Codename:    bullseye
      vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ 
      
      trainee@gpu-compute:~/a_standalone/events/o2-pp-10$ lsb_release -a
      LSB Version:    n/a
      Distributor ID:    AlmaLinux
      Description:    AlmaLinux 9.6 (Sage Margay)
      Release:    9.6
      Codename:    n/a
      trainee@gpu-compute:~/a_standalone/events/o2-pp-10$

       

      Tried for generating Dataset: But here need O2sim and CCDB, alien-token-init etc. 
      Is there a way to use EPN DataSets? or Some other methods. 

      vikas@vsinghal:/localdata/standalone$ ~/AliceGPU/sw/SOURCES/O2/daily-20250808-0000/daily-20250808-0000/prodtests/full_system_test.sh 
      Missing O2sim environment
      vikas@vsinghal:/localdata/standalone$ alienv enter O2sim/latest
      ERROR: O2sim/latest was not found
      vikas@vsinghal:/localdata/standalone$

       

    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

      OpenCL

      Open issues with PoCL:

      • Memory corruption in ZS decoder
      • crash when compiling stream compaction kernel
      • new: crash in ZS decoder
        • somehow missed this last week, doesn't happen when OpenCL optimizations are disabled

      looking into crash in stream compaction -> small kernel but not straightforward to reproduce, seems to require identical layout as kernel in O2

      GPU Servers

      Possible configurations that allow buying 64 core threadripper:

      1. Stripped down components (less storage, 5070 TI instead of 5080 for dev machine)
      2. Get Ryzen 9950x instead as CPU for CI machine
        • Only 1 mainboard available with AM5 socket, IPMI + 2 GPU slots
        • Mainboard has expected delivery time of 2-3 months...
      3. Reuse alibi afterall (currently has RTX 2080 + Radeon VII)
        • Machine needs at least OS upgrade anyway
        • Buy a new (regular 2U) server for simulation jobs instead

      Spreadsheet of all components + cost: https://docs.google.com/spreadsheets/d/1CcPUBvk4QVq344NOnXja-OjEBX1mm0_sFebN-5t1iz0

      (TODO: update spreadsheet for option 3)

    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)