Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Need to check ROCm 7.2 corrtecness.
      • Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • Check for unnecessary f64 instructions in GPU code.
      • PR with changes for qTot-reading / dEdx code, to support >16 bit range for saturated signals.

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?

       

      EPN GPU Topics:

      • Test of MI50-EPN equipped with 1 RTX Pro 6000 quite successful.
        • GPU temperature saturates at 78°C under full load.
        • At high temperature, seeing up to 2% degradation in GPU clocks and GPU performance.
        • 1 RTX Pro 6000 can run at full speed in FST shared by both CPU sockets / NUMA domains.
        • Right now, EPN with 1 RTX Pro 6000 Max-Q (300W TDP= reaches 5/6th of the performance of 1 8-MI50-EPN.
          • (From CPU side, would need ~280 64-core EPNs, but then GPU performance should be slightly above 8*MI50).
          • Perhaps 1 RTX Pro Server GPU (600W) might already be enough, 2 * RTX Pro 5000 might be another option. Almost certainly, next gaming GPU generation will be fast enough.

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster finder

      • Successful Pb--Pb runs last friday 🥳
      • Some preliminary results from cpass0:

       

      • 16% CTF size reduction

      NN improves separation power by ~10%!

       

      • DCA distributions maintained
      • Shared cluster map very similar
      • NCl/track shows the expected reduction: Approx. 2-3 clusters for long tracks

       

      GPU benchmarking

      • PR: https://github.com/AliceO2Group/AliceO2/pull/15514
      • Benchmark single GPU performance across servers (testing on epn000 and dev00 server)

       

      • Multithreading CPU processes for ITS reduces downtime between gpu-reco processes enough to make measurement
      • Top: Default reco, Bottom: NN full reco. One TF of LHC24ar. TFLOOP=100
      • Allowed downtime between gpu processing steps: 50ms

       

      All detectors, default settings

      • epn000

      • pdp-dev00 (RTX5080), single GPU:
        • [1] Downtimes too large for default reco (80-130 ms even when setting CPU process threads to 24) -> Best guess-timate
          • Wall-time mean including allowed gaps: 0.390235 s
            Individual duration sample mean: 0.364311 s
            Individual duration sample sigma: 0.00405494 s
        • [2]
          • Wall-time mean including allowed gaps: 0.607306 s
            Individual duration sample mean: 0.587373 s
            Individual duration sample sigma: 0.0101604 s

       

      Summary:

      • Processing time increase using MI50 NN / default reco: ~1.5
      • Processing time increase using RTX5080 NN / default reco (to be taken with a grain of salt): ~1.5

      -> No improvement (so far) for NN using RTX5080 over MI50 (concerning hardware acceleration)

       

      Measurement with "WORKFLOW_DETECTORS": "TPC,CTP"

      • epn000

      • pdp-dev00 (RTX5080):

      Summary:

      • Seems much more precise and with less losses than using all detectors. Can load GPU much better (way less losses of TFs due to exclusions...)
    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      PR with opt params

      • PR is online
      • Added two architectures to FindO2GPU.cmake
        • Hopper
        • MI300 (GCNA 4.0)
      • Changed Ampere detection from cc 8.6 to 8.0 (why it was like this?)
        if(CUDA_FIRST_TARGET GREATER_EQUAL 120)
          set(CUDA_TARGET BLACKWELL)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 90)
          set(CUDA_TARGET HOPPER)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 89)
          set(CUDA_TARGET ADA)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 80) # was 86 before
          set(CUDA_TARGET AMPERE)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 75)
          set(CUDA_TARGET TURING)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 60)
          set(CUDA_TARGET PASCAL)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 30)
          set(CUDA_TARGET KEPLER)
        elseif(CUDA_FIRST_TARGET GREATER_EQUAL 20)
          set(CUDA_TARGET FERMI)
        else()
          set(CUDA_TARGET TESLA)
        endif()

      Best parameters comparison - Nvidia

      Kernels in single stream steps

      Kernels in multi stream steps

      Next steps

      • Validate against manual tuning
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
       

      NextGenTrigger Task 1.7

      •  

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Writing ACAT Proceedings
      • Next Steps:
        • Make better use of SoA to improve performance
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      CI Server: No update since Sergio left.

      Highly Ionizing Particles

      • MC label forwarding for saturated clusters done
      • This uncovered a bug where last digit of a sector had wrong labels (fixed now)
      • Working on CPU version of tail filter now
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Gabriele

      Tested new parallel algorithm on CPU, similar to current implementation but with a parallel approach:

      1. For each track, find z component of DCA to beamline
      2. Sort tracks per z value
      3. For each track, count how many tracks fall into a window centered in z (and are time compatible) --> density
      4. For each track, check if it is a local maximum (look at neighbours)
      5. For each local maximum, fit tracks which falls into the windows
      6. For each seed, check if it is not a duplicate with neighboring seeds
      7. Surviving seeds promoted to vertices

       

      Current version:

      [1727121:its-tracker]: [17:12:06][INFO]  - Vertex finding: found 132 vertices (total 132) in 50.63 ms

      Parallel version (20 threads):

      [1725878:its-tracker]: [17:05:17][INFO]  - Vertex finding: found 132 vertices (total 132) in 14.35 ms

      Preliminary results from 4 simulated Pb-Pb 50kHz TF:

       

      Next steps

      • Validate with more TFs
      • Optionally tune parameters
      • Implement the GPU version

       

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)