Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • New GPU architecture selection for async in O2DPG looks good, should be merged.
      • Test with GPU GRID jobs at NERSC pending.
      • Asked DPG to run first test with ITS tracking on GPU on EPNs.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • LLVM Bump to 20.1: status?
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Compilation errors fixed in ROCm trunk, to go into ROCm 7.
        • Validated standalone benchmark in deterministic mode, and ran FST.
        • Performance regression of ~1.5% without RTC, no regression with RTC. Reported to AMD. Not clear if this will be fixed.
        • Serialization bug on MI100 still not fixed, still requires the workaround.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Compilation issues on GPU fixed.
        • Matthias did some checks, found two minor bugs to be fixed by Sergey.
        • Speed seems ok, 0.25s for merging 2 maps.
      • Pending OpenCL2 issues:
        • printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
        • Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
        • Also with optimization disabled, crashing later in TPC merging, need printf to debug.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
        • Now has general improvements in all aspects: higher efficiency, lower clone rate, more clusters rejected online, better cluster to track association, faster on AMD GPUs (though slightly slower on NVIDIA and CPU).
        • Trying to understand one problem found by Ruben where it gets stuck processing a TF, then we might merge it to benefit from the improevements.
        • Further developments in different branch on top.
      • Bug in multi-threaded pipeline when timeframes do not arrive in order. Happened again, Ernst and Giulio obtained a stack trace and attached a gdb. Not fully clear why, but the completion policy is stuck. Tentative fix for a possible deadlook between the completion policy and DPL is here https://github.com/AliceO2Group/AliceO2/pull/14640. Need to test online if the problem goes away.
      • gpu-reconstruction quitting with error in some async jobs due to running out of buffers, need to check but probably bad TPC data.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Online runs - summary

      • Tested multiple configurations
      • Almost all failed
        • Identified reasons (out-of-bound writes due to more thread-spawning than requested), fixed and PR merged yesterday: https://github.com/AliceO2Group/AliceO2/pull/14657

       

      Physics: ITS-TPC matching

      • First investigation on its-tpc debug workflow was a bit inconclusive
      • Second investigation on debug trees leads to more results:
        A = Simulation with CF regression, B = simulation with regression network (should show matching losses)
      • Shown: ITS-TPC matched tracks found in one simulation but not in the other. The plotted parameters will then be from the first simulation.
        (selected by the same isPhysicalPrimary as the histograms that show the losses)

       

      Reminder: Major losses observed in 0.2 < | q/pT | < 2, ...

      ... as found in the debug trees:

      Z:

      sin(phi):

      tan(lambda):

      Reason why I didn't see them in the eta distribution: Window of the plot was too small.

      nClustersTPC:

      So many of the lost tracks are at NCL -> 60. How about the ones with NCL > 100?

      Let's have a look at the inverse situation: ITS-TPC matches found in B but not in A:

      -> NCL > 100 similar here (so maybe its just an MC label mismatch: {track ID, event ID, source ID})

       

      Finally, the loss is found:

       

      So the matching efficiency loss is now understood and comes from the high inclination region. Q: Anything to do about it?

       

      There is also no (significant) loss in TPC efficiency at high eta, independent of the threshold:

      -> Conclusion: This must be an effect of changed track parameters.

       

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      Changes after the mismatch fix

      • Tuned PbPb 50 KHz after fixing the HIP/CUDA mismatch
      • Lowered grid sizes to maximum of 600 (kept max 900 only for some small kernels)
      • Lowered number of trials
      • New constraint in the optimization function:
        if min_block_per_cu * max_threads / 256 > 10:
           return float("inf")

        Penalize configurations that would make the second parameter > 10
      • In this way, avoid kernels with no register spillage (since the compiler ignores min_active_warps > 10) but with high grid size
      • Forces the algorithm to focus on small but good configurations
      • Best parameters perf [ms] Before mismatch fix After mismatch fix + constraints
        Clusterizer step 732.18 757.02
        CompressionKernels_step1unattached 523.69 524.31
        GMMergerCollect 93.86 94.40
        GMMergerSectorRefit 284.58 267.72
        GMMergerTrackFit 761.78 704.26
        FollowLoopers + CompressionKernels_step0attached 508.43 480.44
        TrackletConstructor step 1091.24 > 4000

       

      For the tracklet constructor step, since it is highly dimensional, it was more probable to fall into the condition min_block_per_cu * max_threads / 256 > 10, leading to many inf values and few acceptable values. Removing the constraint lead to an optimal solution of 1128.05.

      Version Mean sync time [ms]
      Default - before mismatch fix 4679.83 ms ± 3.64 ms
      Default - after mismatch fix 4682.61 ms ± 5.77 ms
      Tuned - before mismatch fix 4317.97 ms ± 4.17 ms (7,73%)
      Tuned - after mismatch fix + extra constraints 4315.11 ms ± 3.06 ms (7,79%)


      Reduced grid sizes in best parameters. Some examples:

      Kernel Best conf now Best conf before
      GMMergerCollect 1024x120 768x600
      GMMergerSectorRefit 64x480 448x840
      GMMergerTrackFit 64x360

      64x900

      CompressionKernels_step0attached 64x180

      192x60

      GMMergerFollowLoopers 128x240

      64x900

      NVIDIA GPUs tuning

      Got access to the NGT cluster. Started to adapt the tuner for Nvidia GPUs.

      ToDos:

      • Investigate why big grid sizes with low occupancy are preferred rather than the small grid sizes with same occupancy
      • Decide if how to integrate new default parameters in O2

      ALICE contribution to HS23

    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

      Working with the TODOs:- 

      A Compiler directive (GCC diagnostic need to be removed in GPU/GPUTracking/Base/hip/GPUReconstructionHIPIncludesSystem.h class,

      but checked that the entire class not needed. If I remove the class then also, no make error. 

      In this class there are few includes only from HIP and Thrust. So we may remove it fully. 

       

       

    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

      OpenCL

      crash when compiling ZS decoder

      • Submitted reproducer
      • Last update from PoCL devs: Not a trivial fix, requires rewriting one step in compiler pipeline

      Memory corruption in ZS decoder

      • Looks like miscompilation after all
      • Issue disappears when running single-thread decoder via OpenCL
      • Kernel consists of two parts
        • Decode header region in page into shared memory
        • Decode adc data from page in global memory 
      • Replacing header decoding with single-threaded variant: Issue almost disappears, but some wrong values still written
      • Dumping written values with printf:
        • OpenCL version writes correct values except for memory corruption
        • So compared to CPU version: expect identical output + bunch of additional writes
        • Actual output: OpenCL writes less values than CPU + almost no overlap in output???
        • -> printf affected by miscompilation, so useless here?

       

      New GPU server

      Final configuration:

      • Dev machine with 64 core threadripper + RTX 5080 (& MI50)
      • New 2U server for alibi, repurpose old alibi to CI GPU machine

       

      Order with Guy. Waiting for update.

    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Gabriele

      • Merged ITS launch bounds into O2
      • Studied ITS GPU code, next week meeting with Matteo for clarifications + coordination with Felix when he's back

       

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)