Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Need to check ROCm 7.2 corrtecness.
      • Need to understand and fix crash on RTX Pro 6000 reported by Oliver.
      • Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver.
      • Understand deterministic mode issue on NVIDIA Blackwell.
      • Performance issue on Blackwell fixed (see below).

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Maps now yielding correct results, but 1.5x performance regression running on GPUs.
        • PR is now green in the CI, POD version of the fasttransform and merging of the maps integrated (thanks to Ruben and Matthias and of course Sergey!)
        • Remaining regression with the new maps vs old maps is due to reduced L2 cache hit rate (to be understood why, perhaps larger metadata?).
          • This is compensated when the maps are merged since we query only 1 map, but could still be improved.
          • Discussing with Sergey why new version is worse with respect to cache, Sergey is interested in reproducing and improving it.
          • If we still want to change it, I would wait with merging, since otherwise we need yet another compatibility layer to load maps that were created in between.
        • On MI50, still miscompiles when RTC and dEdx are enabled.
        • New code without RTC is 14% faster than old code with RTC on MI50.
        • New maps fix performance issues on new NVIDIA GPUs. 4090 is roughly 3x faster (with RTC vs with RTC).
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • Check for unnecessary f64 instructions in GPU code.
      • Want to retune NVIDIA Blackwell after the performance issue was fixed, and manually look what can be optimized, to get a first realistic estimate how many GPUs we need.
      • Working on some improvements for general GPU code and TPC POD Fast Transform: get rid of defines, use constexpr; unify multiply defined constants, and use global GPU constants in FastTransform, move TPCFastTransformGeo to GPU constant memory.
      • Matthias is working to get rid of FlatObject dependency in FastTransform splines, will reduce memory foorprint slightly and hopefully improve cache efficiency.

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
      • Test at NERSC still ongoing, all jobs so far failed for non-gpu related reasons. Currently jobs in wait till 6d.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster error parameterization

       

      • Test optimization on default cluster error mode with scaling
      • Best parameters found within bounds:
        • scaleChiY1: 1.7779086962105004
        • scaleChiY2: 0.5157460329068477
        • scaleChiY3:1.343090130640468
        • scaleChiZ1: 1.0739495259708887
        • scaleChiZ2: 0.5274259679551516
        • scaleChiZ3: 1.1181900592137066
      • The "improvement": +0.15% efficiency, +0.2% clone-rate, -0.5% fake-rate
      • All of them around bound by [0.5,2.0] by the optimization bounds

       

      Reminder (GPUTPCGMPropagator.h)

      That means: E.g. strongest change: scaleChiY2 and scaleChZ2 -> even a factor of ~2 does not change the behaviour of the tracking much. Suspected reason: chi2 of 9 for a cluster must be an extreme outlier -> Cluster is far away and probably not track-attached in the first place so cut only takes effect when scaling factor (e.g. scaleChi*) is large

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      Tuning polishing almost ready

      • Auto detects GPU vendors
      • Installs all dependencies in a separate python environment (expect profiler)
      • README with instructions in progress
      • Idea to set a desired duration of the tuning and set number of iterations based on that

       

      Accepted talk on ALICE experience with GPUs at workshop on computing of INFN (11-15 May)

      • Will contact Maxim for slides on GPU usage & monitoring on the GRID
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
       

      NextGenTrigger Task 1.7

      • CHEP talk together with Jolly got accepted.
      • Co-Supervision of a summer student. Topic: Imrpove clustering algorithm CLUE.

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Benchmarking:
        • Two independent measurements for each of the 4 NGT GPUs (standlaone + profiler)
        • Integrated in CI-pipeline with a comparison to the unaltered code
        • Results are now much more consistent, at least with the profiler
      • Ran all the benchmarks with and without monitoring: results looked very consistent.
      • Next Steps:
        • Build and run O2 with C++26 compiler to integrate reflections (needed for CHEP benchmark) 
        • Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
        • Make better use of SoA to improve performance
        • Try David's suggestion
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Felix: no news

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)