Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
      • ROCm 7.2 released, waiting for EPN to set it up with new dev2 server.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Sergey is checking.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542

       

      Other topics:

      • Molex connectors for GPU CI server arrived, still waiting for cables.
      • Removed bogus setting of rocm path in LD_LIBRARY_PATH by ONNXRuntime recipe. Seems to have no side effects.

       

      EPN GPU Topics:

      • Status of power cable problem for dev2 server and MI210?

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
      • Fixed problem of cuda and rocm library detection: https://github.com/alisw/alidist/pull/6093
      • Highlighted problem about GPU architecture recognition (PR by David)
      • Highlighted problem about virtual py-env aliBuild not being used for building (PR by Giulio)

       

      Currently working on dumping the cluster errors in the right place (work ongoing now since I was on vacation for some days)

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      Benchmarks table

      About L40S

      Note: L40S uses the AMPERE default parameters

      For real PbPb data:

      Sync mean time default: 2018.66 ms ± 167.23 ms

      Sync mean time optimised: 1300.35 ms ± 92.61 ms

      Performance gain with tuned params: 35.58%

      For sim PbPb data:

      Sync mean time default: 1677.37 ms ± 2.52 ms

      Sync mean time optimised: 1031.51 ms ± 2.85 ms

      Performance gain with tuned params: 38.50%

      New GPU parametrization

      • PR is online [link]
      • I will update with the logic of the new GPU detection once it is merged
      • Possible to suppress clang for that json?
      • MI100 bug?
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Implementing our SoA code in MadGraph, together with
        • Stefan Roiser
        • Daniele Massaro
      • No news

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
      • Maybe go back from CRTP to classical inheritance
      • Better implementation of iterators to support
        • std::sort
        • thrust::sort
        • our custom sort function
      • Next Steps:
        • Make better use of SoA to improve performance
        • Try David's suggestion
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      Dev machine: no news.

      Highly Ionizing Particles

      Quick followup to last week:

      Finished vectorizing kernel (GPU threads load 4-8 charges at once).

      Performance barely changes...

      Also old kernel was assuming wrong memory layout (pad-major instead of time-major).

       

      Pad filter kernel parallelises across pads. So switching to time-major layout and vectorizing memory access also reduces number of threads by same amount...

       

      Fixed parallelization across rows:

      • Fixed memory access to time-major layout
      • Fixed bank conflicts when accessing shared memory

       

      Increases throughput from 80 GB/s to 128 GB/s. Basically identical performance to current version (130 GB/s).

      TODO: Needs testing on Nvidia.

      Pad filter and cluster finder currently assume 64 byte cache lines, Nvidia moved to 128 byte cache lines a couple generations ago.

    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Felix:

      Found TF which cannot be processed on GPU but survives CPU code (peak RSS ~15GB).

      Implemented more memory clearing in https://github.com/AliceO2Group/AliceO2/pull/14999 now the same TF survives on GPU (peak allocation ~18 GB).

      Note though, that this is at the price of more device<->host<->device data movement.

      This should care of the bulk of the (smallish) difference. Remaining: verify with deterministic mode (& maybe check impact on timing, although recovery of TF is more important).

       

      To bring back a thing of the past. In the CI and also local compilations on EPNs I observe always errors of this kind:

      ```

      In file included from /opt/rocm-6.3.2/lib/llvm/lib/clang/18/include/cuda_wrappers/bits/basic_string.h:7:
      In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/basic_string.h:40:
      In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/ext/alloc_traits.h:34:
      /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:272:8: warning: 'destroy<o2::its::TrackITSExt>' is deprecated: use 'allocator_traits::destroy' instead [-Wdeprecated-declarations]
        272 |         { __a.destroy(__p); }
            |               ^
      /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:378:4: note: in instantiation of function template specialization 'std::allocator_traits<std::pmr::polymorphic_allocator<o2::its::TrackITSExt>>::_S_destroy<std::pmr::polymorphic_allocator<o2::its::TrackITSExt>, o2::its::TrackITSExt>' requested here
      

      ```

      Notice that I see something like c++/11/... from what I assume that this is pulling in c++11?

      This is of course just a deprecation warning so not actually critical (for now...).

      But David and I had a chat about this a while back and had the following problem:

      Almost pseudo CMake code:

      ```

      set(CMAKE_HIP_HOST_COMPILER "$ENV{GCC_TOOLCHAIN_ROOT}/bin/gcc")

      message(STATUS "SETTING CMAKE_HIP_HOST_COMPILER to ${CMAKE_HIP_HOST_COMPILER}") 

      enable_language(HIP)

      message(STATUS "HIP language enabled: ${CMAKE_HIP_COMPILER}")

      message(STATUS "COMPILER to ${CMAKE_HIP_HOST_COMPILER}")

      ```

      Output

      ```

      -- SETTING CMAKE_HIP_HOST_COMPILER to /home/fschlepp/alice/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-2/bin/gcc

      -- HIP language enabled: /opt/rocm/llvm/bin/clang++

      -- COMPILER to

      ```

      Somehow after enable_language the host compiler is reset to what? probably system.

      Would be nice to somehow repair this.

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)