Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Need to check ROCm 7.2 corrtecness.
      • Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • Check for unnecessary f64 instructions in GPU code.

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
      • Test at NERSC successfull, first results presented at CHEP and LHCP.
        • GPU TPC + ITS tracking working nicely on A100 GPU, no backpressure from GPU, but workflow is CPU-bound, as on EPNs.
        • For some reasons, the GPUs at NERSC perform much slower than on EPNs (even though both are 64 cores), should be investigated.

       

      EPN GPU Topics:

      • Test EPN with RTX 6000 available, started testing yesterday evening, but no results yet.

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster finder

      • Successful Pb--Pb runs last friday 🥳
      • Some preliminary results from cpass0:

       

      • 16% CTF size reduction

      NN improves separation power by ~10%!

       

      • DCA distributions maintained
      • Shared cluster map very similar
      • NCl/track shows the expected reduction: Approx. 2-3 clusters for long tracks
    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
      • Fully implemented gpu limits aware version
      • Block size: from warp size to max block size
      • Blocks per SM: computed as sampled fraction of SM * max_blocks(block_size_sample)

      Next steps

      • Study interplay of parameters across GPUs

      ALICE GPU workload for HEPScore23

      • Bumped O2 version in the workload, with newest tpctransform.dump
      • This afternoon will meet with Robin

       

    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      CI Server: No update since Sergio left.

      Highly Ionizing Particles

      QC plots from TPC

       

      Next steps

      • MC forwarding for saturated clusters (mostly done, but still need to test)
      • CPU version of tail filter
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Gabriele: no news, from this week started to work again on gpu seeding vertexer

      Felix: no news

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)

      physics PbPb runs with NN clusterizer successfully taken on Friday

      • full NN: 572485
      • CF regression: 572486

       

      HIP tail filter

      • two short runs taken with filter enabled: 572487, 572488
      • TPC QC experts reported tracking performance to be identical with other runs
      • Jens want to do a few more check on the Qtot distributions where he sees some differences
      • once confirmed by Jens, we should enable the HIP tail filter for the remaining physics runs

       

      OS update on EPNs

      • want to update to alma 10 asap during LS3
      • need to clarify compatibility of our code and a rocm 7 release
      • summary by David during S&C days October 2025 
        • miscompilation on MI100
        • miscompilation with RTC+dEdx enabled
        • another miscompilation with RTC disabled
        • 1.5 % performance penalty