Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
      • Updated GPU buffer sizes for 0.2T low field TPC processing, will move parameters to configurableParam, so we can change them without rebuilding.
      • Problem in 1 run where online CCDB updates were not working, and TPC track model encoding used incorrect field settings. Will implement a workaround for decoding, to use stored incorrect field for decoding but then correct field for tracking.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Tested ROCm 7.2 on MI50 / 100 / 210. Running stably on 50 / 210, not checked for correctness yet. Crashes randomly on MI100, but seems to be different pattern compared to serialization bug we had before.
      • Need to understand and fix crash on RTX Pro 6000.
      • New GPU Builder Container with CUDA 13.1.1 available, supporting GCC 15.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542

       

      Other topics:

      • Major improvement of GPU CMake and parameter loading:
        • Instead of detecting individual architectures, now performing >= comparison on compute capabilities, and if no tuning for exact architecture available, take parameters for the closest previous architectures where we have tuned values.
        • Support building O2 with database of tuned parameters provided in CSV format.
        • Support to merge on-the-fly multiple database files in CSV and JSON format.
        • Can create binary .par files for loading at RTC from CSV and JSON file with simple script.
        • Sped up GPU CMake from ~2sec to ~0sec.
      • Switching from hard-exporting ALIBUILD_O2_FORCE_GPU=1 in GPU builder container, moved to env files of CI jobs and to default of jenkins builders where GPU is needed, and can e.g. be disabled for dataflow defaults CI jobs / jenkins builders.
        • Everyone using the slc9-gpu-builder container, or Jenkins to build with GPU, please note that to get the old behavior ALIBUILD_O2_FORCE_GPU=1 must be exported.
        • Note that by default, GPUs will be autodetected, so all backends would be available, but builds would be to the fallback architectures if no GPU is detected (MI50 for AMD; sm_75-virtual for NVIDIA), and not to our default list of production architectures (including MI100, RTX Pro 6000, ...).
      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2.
      • Opened PRs to bump CMake to 4.2 (https://github.com/alisw/alidist/pull/6135), boost to 1.90 (https://github.com/alisw/alidist/pull/6134), Giulio will take care of bumping GCC to 15.2, Could also think about bumping arrow and clang.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      • Tested Radeon VII on dev machine
      • Used current default parameters of MI50 which are optimal for lhc24ar_raw dataset
      • Noticeably slower than MI50 32GBs for the first three events
      • The reported times have been taken with the following command:
        ROCR_VISIBLE_DEVICES=0 numactl --membind 0 --cpunodebind 0 ./ca -e lhc24ar_raw --gpuType HIP --memSize 15000000000 --inputMemory 6000000000 --outputMemory 10000000000
        --sync -s [event_index] -n [event_index + 1] --runs 10 --PROCdoublePipeline --RTCenable --RTCcacheOutput

      Graphics Card

      Runtime

      OS

      Event 1 [s]

      Event 2 [s]

      Event 3 [s]

      AMD MI50 32GBs

      ROCm 6.3.2

      AlmaLinux 9.5

      4.86 s

      5.57 s

      5.63 s

      AMD Radeon VII

      ROCM 7.1.1

      AlmaLinux 10.1

      6.35 s

      7.32 s

      7.41 s

       

      • Bug present in current O2 when an RDNA gpu or a gpu equal or newer than a MI210 tries to compile. Since specific parameters are not present yet, static_assert(GPUCA_PAR_AMD_EUS_PER_CU > 0); will fail since the parameter will be defaulted to 0. I will tune the MI210 and open a PR with the parameters for both archs which will fix this. Quickest temporary f ix is to fallback to MI100 in case these archs are detected
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Implementing our SoA code in MadGraph, together with
        • Stefan Roiser
        • Daniele Massaro
      • No news

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Went back from CRTP to classical inheritance
      • Implemented new way of providing iterators in O2: https://godbolt.org/z/haas6YP6c
      • Issues since the latest updates of O2:
        • CI-pipelines on the NGT cluster fail (problem of missing libraries)
        • GPUs that are not inherently supported by the new O2 fail (thanks Gabriele for the workaround)
        • Missing build dependencies on CVMFS (Giulio is on it)
      • Next Steps:
        • Adapt unit test to the changes
        • Adapt benchmarks to the changes
        • Make better use of SoA to improve performance
        • Try David's suggestion

       

      Issues with AMD W7900 after updating to latest O2

      1. Providing a .par file with certain Parameters results in non-deterministic behavior, despite deterministic mode.
      2. Can't create my own .par files with dumpGPUDefParam.C of newest O2. Yields error incompatible launch bounds.
      3. Changing FindO2GPU.cmake so that the VEGA or TAHITI setting are applied, results in compilation error: static_assert(NTHREADS == GPUCA_WARP_SIZE)
      4. The only parameters that worked on W7900 are the ones for MI100 with WARP_SIZE changed to 32.
      5. Other parameters (e.g. RDNA) yield deterministic behavior, but with a (slightly) different GPU.out than obtained wit other GPUs.

      Moreover:

      1. In my setting, I couldn't just add new GPUs to GPUParameters.csv: The headers of the newly added GPUs were created, but not the corresponding .par files.
      2. In GPUParameters.csv, the AMD GPUs RDNA and MI210 are missing, but they show up in FindO2GPU.cmake. Is this intended?
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      Documentation page: https://alice-pdp-operations.docs.cern.ch/gpu-dev/

      Highly Ionizing Particles

      Tail masking almost done. Currently working on testing implementation.

      Investigating crash on Nvidia. (happens even when not triggered)

    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Gabriele:

      • Restricted computing of vertex function to beam line V(0,0,z)
      • Substituted gaussian distribution with capped polynomial -x2 + 1 (thanks Christian)
      • Reduced bins in voxelisation
      • Added extra conf params for better tracking
      • Now runtime of vertex function computation is similar to old vertexer!

      Some results (in red true vertex of that ROF)

      Next steps

      • Implement the logic to select vertex candidates, track candidates and fit the vertices
      • Tune parameters of the functions
      • Test CPU performances
      • If comparable with old vertexer, proceed with GPU implementation

       

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)