Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speaker: David Michael Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Tested ROCm 7.2 on MI50 / 100 / 210. Running stably on 50 / 210, not checked for correctness yet. Crashes randomly on MI100, but seems to be different pattern compared to serialization bug we had before.
      • Need to understand and fix crash on RTX Pro 6000.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • TODO: Workaround for wrong field used for encoding online, make memory scaling factors configurable via ConfigurableParam

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
      • Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
        • required to adapt / bump ~30 packages, now nearly done.
        • Remaining issues: 1 problem in O2 (only on Mac) and one in O2Physics (wrong boost usage), PRs with fixes open.
        • Need new DD tag, PR open.
        • Problem with new libwebsocket on RHEL7 due to bogus kernel headers in that version colliding with glibc. Must either switch AliRoot CI to SLC9 and drop slc7 support, or we can disable ipv6 for slc7.

       

      EPN GPU Topics:

       

    • 10:20 AM 10:25 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster errors

      • General observations
        • NLL (negative log-likelihood) loss works better for convergence than MSE (mean-square error) loss
        • Scaling necessary (requires tuning of parameter -> currently only one parameter, but optimally take 2 separate ones)
        • Training now rather stable and testing different configurations
          • Also tested the idea sigma / sqrt(qTot). Worked decently well, reaches similar number of tracks, but efficiency is down by 10-15%
             
            • Feeding in both cluster and track position completely deteriorated the fit, no long tracks found
            • Getting rather good results now for long tracks even though total number of tracks is still not as high as with default method -> Need to adjust scaling parameter

       

      • Next try: Retuning x and y scaling separately

       

       

    • 10:25 AM 10:30 AM
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      No recent news

      Step / Kernel MI50 32GBs [ms] Radeon VII [ms] Speedup
      clusterizer step 685.15 740.56 1,08
      mergerCollect 70.74 68.61 1,03
      mergerTrackFit 640.83 1452.46 2,26
      tracklet step 1139.16 1305.79 1,15
      compression unattached 643.22 645.79 1,00
      mergerSectorRefit 303.76 628.68 2,07
      FollowLoopers + compression attached 1116.16 1077.43 1,03
    • 10:30 AM 10:35 AM
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • CHEP talk together with Jolly got accepted

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • The SoA library changed a lot in the course of implementing it in O2. Thus I have now adapted:
      • Next Steps:
        • Make better use of SoA to improve performance
        • Try David's suggestion

      comparison to other SoA approaches

      Issues with AMD W7900 after updating to latest O2

      1. Providing a .par file with certain Parameters results in non-deterministic behavior, despite deterministic mode.
      2. Can't create my own .par files with dumpGPUDefParam.C of newest O2. Yields error incompatible launch bounds.
      3. The only parameters that worked on W7900 are the ones for MI100 with WARP_SIZE changed to 32.
      4. Other parameters (e.g. RDNA) yield deterministic behavior, but with a (slightly) different GPU.out than obtained wit other GPUs.

      Other issues

      1. In my setting, I couldn't just add new GPUs to GPUParameters.csv: The headers of the newly added GPUs were created, but not the corresponding .par files.
      2. A particular build dependency of O2 is missing on CVMFS: ninja/fortran-v1.11.1.g9-16 (only .g9-15 is present). I am in touch with Sergio.
      3. In GPUParameters.csv, the AMD GPUs RDNA and MI210 are missing, but they show up in FindO2GPU.cmake. Is this intended?

       

    • 10:35 AM 10:40 AM
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 AM 10:45 AM
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      Waiting for PCIe 8-pin cable. Haven't heard from Guy?

      Highly Ionizing Particles

      Investigating crash on Nvidia. (happens even when not triggered) FIXED

      Tail masking working with injected tails.

      Next steps: 

      • Test with simulated TF
      • Add flag to toggle tail masking

       

    • 10:45 AM 10:50 AM
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Felix: worked on porting the ITS2 staggering readout tracking to the GPU code (not done yet, there is some crash I have to understand but CPU works:

       

      Gabriele:

      • Talked with Max Puccio
      • Restricting computation of Vertex function to V(0,0,z) might be problematic due to possible shift of the beam line
      • Currently implemented 3d version
        • For each voxel centre computes the vertex function
        • Then peak finder is run to find peaks and get vertex candidates
        • Currently tuning parameters with optuna
        • Precision: 1.0000, Recall: 0.7133, let's see where the plateau is
      • 3d version too expensive anyway (16s seconds per PbPb TF)
      • If tuning gives good results, I will try to find a way to preselect tracks per voxel to restrict computations
      • Will try another implementation with 3d histogramming
        • Basically where the track contributes only for bins where it passes through
    • 10:50 AM 10:55 AM
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)

      P2 tests with replay pp 750 kHz

      • MI100 fully serialized
      • O2_GPU_RTC=0

       

      gpu-reco TF processing rates:

      MI50 - 1.97 Hz
      MI100 - 1.74 Hz

      • theoretically (using simple scaling) sufficient for 1 374 kHz with full farm (270 MI50s, 65 MI100s), but not for TPC high rate tests for FC stability
        • can simply use 2025 PbPb software for high rate tests
          • can enable GPU RTC and we expected it to be fast enough for 50 kHz PbPb when disabling looper following

       

       

      P2 tests with replay PbPb 50 kHz

      • MI100 fully serialized
      • O2_GPU_RTC=0

       

      gpu-reco TF processing rates:

      MI50 - 0.64 Hz
      MI100 - 0.78 Hz

      • slowdown by 28 % on MI50, 23 % on MI100 compared to the 2025 PbPb SW version (with O2_GPU_RTC=1 )
      • theoretically sufficient for 30 kHz with full farm (270 MI50s, 65 MI100s)