Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
        • Updated default builds to include A100 GPU architecture, and 75-virtual as lowest computa capability for CUDA JIT compilation, so we should support all CUDA devices from 7.5 onwards now.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Rebased current PR, CI green now, and gives same results on GPU as on CPU. But result seem wrong, finding 10% less tracks than without the PR.
        • Sergey provided a fix for time to z inverse conversion, but still not fully working, now finding 2% less tracks than without the PR.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542

       

      Other topics:

      • GPU CI Server: Cannot power the GPUs right now, because the voltages / pinout of the connector on the mainboard for the PCIe cable does not match the cables we have (even though the physical connector is the same, and both are HP cables. This is basically creating a short. Thank you HP...).
      • Need to find other cables, or build our own cable.

       

      EPN GPU Topics:

      • AMD cannot deliver MI210 or newer samples, but Volker has some spare MI210 in Frankfurt, which he can send.
      • To be inserted into the EPN farm, together with 1 MI50 and 1 MI100 as second dev-server with EPN setup. (https://its.cern.ch/jira/browse/EPN-572)
        • MI210 GPU on the way to CERN, Ivan should bring it end of this week.
    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      GPU model speeds comparison

      1. Using prebuilt wheels, python installation of MIGraphX execution providers is usable on AMD GPUs
      2. With access to the NGT cluster hardware (huge thanks to Oliver!) benchmark was extended modern GPUs

       

      1000 evaluations of batches of size 262144, input size per element (3,9,9). 10 warmup evaluations (excluded from measurement). Color is normalised per column

      Unfortunately model compilation for CNN on older AMD GPU is not supported (blank spots in table).

      Overall fastest for our deployment case: MI300X, second place goes to H100. MI300X outperforms H100 by a factor of ~2.

      FP32: CPU (32 threads) is still a factor 30 slower than even the slowest GPU

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      • Started performance study on standalone-benchmark on different GPUs
      • Selected datasets:
        • Real PbPb data (lhc24_ar)
        • Simulated PbPb 47kHz data
        • Real pp data?
      • Fixed O2 release
      • For each GPU
        • For each dataset
          • Tune parameters for that release
          • Measure sync wall time with RTC
          • Measure async wall time without RTC

      GPU Real PbPb Sim PbPb Real pp
      NGT H100 Done Done ToDo
      NGT RadeonPro Done Done ToDo

      EPN MI50

      ToDo ToDo ToDo

      EPN MI100

      ToDo ToDo ToDo

       

      • json GPU params ready ✅
      • Generation triggered by changes on json
      • Defaults header generated into ${CMAKE_BINARY_DIR}/GPU/GPUTracking/GPUDefParametersDefaults.h
      • MI100, VEGA, AMPERE, TURING headers generated along .par files
        DEBUG:O2:O2:0: [5/11] Generating GPU parameter header for AMPERE
        DEBUG:O2:O2:0: -- Generated /home/gcimador/alice/sw/BUILD/968b1d99620a47d7c1f36416d4d12717459dba67/O2/GPU/GPUTracking/genGPUArch/GPUDefParametersDefaults_AMPERE.h
      • Todo:
        • Correctness tests
        • CSVtoJSON script
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Implementing our SoA code in MadGraph, together with
        • Stefan Roiser
        • Daniele Massaro
      • Used NGT cluster and it's GPUs for Christian's benchmarks (2x Nvidia, 2x AMD)
      • Will get next update on Thursday (January 15)

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
      • Switch between AoS and SoA is now just one flag.
      • AoS with our new data layout abstraction has zero overhead to the old code
      • Made some changes to SoA/AoS code to make it more user-friendly.
      • Next Steps:
        • Make better use of SoA to improve performance
        • Try David's suggestion
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      News from GPU ITS seeding vertexer (Gabriele)

      What I am doing: looking for a new seeding vertexing algorithm

      Why:

      • Current one is serial (cannot be ported to GPUs)
      • Quality of the seeds needs to be improved in order to use the per-vertex tracking which reduces memory usage

      What I want to do:

      • Compute tracks of the first three ITS layers
      • Linearize the track on the innermost cluster, to work with lines in the beam line
      • Define a vertex function, which returns a score for each point in the 3D space.
        If we want to calculate the score in a point in space, we have to:
        For each track, compute a score based on the distance of the track to the point
          
        Sum the contribution of each track on that point (second term to suppress contributions from single tracks)


      • Points with highest score will indicate the possible position of a vertex

      • Ideal for GPU: each thread takes a point and computes the score. No branches and same number of iteration for each thread (number of tracks)
      • ATLAS did more or less this but on CPU

      What I did so far:

      • CPU version of the vertex function

      Transverse plane of 10 random tracks + linear prolongation from a ReadOutFrame (subset of a TimeFrame):

      zy plane of the same tracks. Beamline highlighted in red. Tracks "stacks" on some values of zs, indicating vertex candidates

      Vertex functon at z=3.151 cm in the beam line (maximum around (0,0)):

      ToDo:

      • Check the quality of the vertex function with MC data
      • If quality of vertex reco is ok, understand how to select the maximum values of the function

       

       

      Felix:

      improved after clearing, but still small-ish discrepancy remains... (not clear to me why yet)

      In any case, ITS tracking ran for the 25 apass1 on GPU :)

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)