Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

      • Crash after receiving corrupt TPC data: Fixed. Handling of bad data was actually correct, but multi-threaded pipeline did not handle the error correctly, and the next TF ran into an error.

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
        • Updated default builds to include A100 GPU architecture, and 75-virtual as lowest computa capability for CUDA JIT compilation, so we should support all CUDA devices from 7.5 onwards now.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Rebased current PR, CI green now, and gives same results on GPU as on CPU. But result seem wrong, finding 10% less tracks than without the PR.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542

       

      Other topics:

       

      EPN GPU Topics:

      • AMD nacnot deliver MI210 or newer samples, but Volker has some spare MI210 in Frankfurt, which he can send.
      • To be inserted into the EPN farm, together with 1 MI50 and 1 MI100 as second dev-server with EPN setup. (https://its.cern.ch/jira/browse/EPN-572)
    • 10:20 AM 10:25 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Average size of looper when charge overlap starts

      I.e. when do we start to find looping legs as tracks and not just mangled clusters 

      • Average pad-width [cm] of clusters vs. row number

      • Use clusters between 60 < row < 100 as average: <sigma_pad> ~ 0.4cm
      • Assume overlap starts with <3*sigma_pad separation: 2.4cm ; Cluster height between 60 < row < 100: ~1.1cm
      • Need 10 clusters for tracking looper legs
        Geometric conditions:
        • Charge mangling: sin(30) = 2.4cm / R
        • Min. NCl for tracking: sin(70) = 5*1.1cm / R
          • R_{looper legs}^{min} = max(condition 1 , condition 2) = 5.9cm = 5.3 pads in height = 9.8 pads in width
      • This coincides well with one of the plots from my thesis

      • Corresponding pT: pT_{looper legs}^{min} = 8.9 MeV/c

       

      Receiver-Operator-Characteristic (ROC) and Area-under-Curve (AUC)

      • 3D > 2D, but no significant difference between all 3D networks
      • Classifier with >90% AUC is typically categorized as excellent, >85% as very good

       

      RTX5080 vs RTX3090

      • Forget the memory usage for now...
      • NN's are approx 2.1-2.4x faster on modern GPU in both FP32 and FP16

    • 10:25 AM 10:30 AM
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      • Started performance study on standalone-benchmark on different GPUs
      • Selected datasets:
        • Real PbPb data (lhc24_ar)
        • Simulated PbPb 47kHz data
        • Real pp data?
      • Fixed O2 release
      • For each GPU
        • For each dataset
          • Tune parameters for that release
          • Measure sync wall time with RTC
          • Measure async wall time without RTC

      GPU Real PbPb Sim PbPb Real pp
      NGT H100 Done Done ToDo
      NGT RadeonPro Done Done ToDo

      EPN MI50

      ToDo ToDo ToDo

      EPN MI100

      ToDo ToDo ToDo

       

      • Json parametrization works✅
      • Changing json triggers regeneration of header
      • Main header generated with detected GPU (or specified)
      • Per architecture headers (AMPERE, TURING, MI100, VEGA) are generated along .par files (in the same dir)
        2026-01-13@16:48:27:DEBUG:O2:O2:0: [4/11] Generating GPU parameter header for AMPERE
        2026-01-13@16:48:27:DEBUG:O2:O2:0: -- Generated /home/gcimador/alice/sw/BUILD/968...a67/O2/GPU/GPUTracking/genGPUArch/GPUDefParametersDefaults_AMPERE.h
      • Need to test for correctness
      • Need to do csv --> json converter
    • 10:30 AM 10:35 AM
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Implementing our SoA code in MadGraph, together with
        • Stefan Roiser
        • Daniele Massaro
      • We had two meetings to solve a few specific problems.

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
      • Switch between AoS and SoA is now just one flag.
      • AoS with our new data layout abstraction has zero overhead to the old code
      • Made some changes to SoA/AoS code to make it more user-friendly:
        • To SoA-ify a class, this class now inherits from a CRTP base class...
        • ... and defines a boilerplate-function that applies a generic function to all members
        • the CRTP class uses this boilerplate function.
      • Next Steps:
        • Make better use of SoA to improve performance
    • 10:35 AM 10:40 AM
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 AM 10:45 AM
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      Dev machine: running at P2. Currently bare bones setup with CUDA and ROCm.

      Reachable as pdp-dev00 on EPNs. Giarda working on Slurm integration. 

      CI machine: Missing cables arrived yesterday.

      Highly Ionizing Particles

      Almost finished vectorized loading of charges in Pad Filter kernel (4 or 8 at once).  

    • 10:45 AM 10:50 AM
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      News from ITS GPU seeding vertexer (Gabriele)

      • Found that ATLAS uses a Gaussian Seeding vertexer with a grid approach to find the maximum of the vertex function
      • --> Motivates to continue into this direction
      • Implemented CPU version
      • Algorithm uses tracks made from the first three layers
      • Uses Taylor expansion on innermost cluster to generalize to a line
      • Creates a grid around the beampipe to look for the maximum of the vertex function
      • Need to adjust resolution
      • Need to test with MC data accuracy of vertex function
      • If quality for PVs reconstruction is good, need to think a way to select the positions
      • Also if evaluate all the grid or only around the pairs
    • 10:50 AM 10:55 AM
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)