Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • New GPU architecture selection for async in O2DPG looks good, should be merged.
      • Test with GPU GRID jobs at NERSC pending.
      • Asked DPG to run first test with ITS tracking on GPU on EPNs.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
        • New tracking is actually faster in the refit, but slower in the looper following. raw has more loopers than MC --> faster in MC, slower in raw overall. Can disable looper following to gain some speed for 50 kHz.
        • TPC laser calib issue fixed, was a bug in tracking developments affecting only triggered data
        • Large memory usage in pp on the GRID - fixed.
        • Gain calib issue was due to new miscompilation.
        • Next iteration of tracking improvements in draft PR: https://github.com/AliceO2Group/AliceO2/pull/14651
      • Should get Gabriele's new parameters merged for Pb-Pb.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Optimization of cluster finder filling kernel

      • Current filling kernel: Same speed as one neural network evaluation
        • Inherently expensive due to memory operations
        • For benchmark: Fully load GPU (loop over filling kernel) and compare current implementation to new implementation in debugLevel=1, FST
        • Read-writes per thread:
          • Reads: 1x peak (2 x int16_t = 4 byte) + 1x chargeMap[peak] (uint16_t = 2 byte) + 9x chargeMap[position] (uint16_t = 2 byte each)
          • Writes: 9x outputCharge (FP16: float16 = 2 byte each, FP32: float = 4 bytes each)
          • Sum: FP16: 43 bytes, FP32: 61 bytes
        • Batchsize: 2097152 = 2^21 (just to load the GPU fully)
        • Kernel execution count: 8640
        • Number of threads that do the dominant read/writes: 3*9 = 27

            Time [s] Bandwidth [GB/s]
          Old, float16 63.345 332.1
          Old, float32 60.261 495.2
          New, float16 52.519 400.6
          New, float32 51.794 576.2
      • Conclusion:
        • Read and write in FP32 is probably ~same speed per operation (not per byte!) and/or casting to FP16 creates overhead
        • Time for evaluation: NN, float32 \approx 2x NN, float16 

       

       

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      • Tuned based on first 3 events of Ihc24ar_raw dasaset
      • Measured sync time before launch bound fix commit, with RTC enabled with default parameters
      • Measure sync time after fix commit, with RTC enabled with tuned parameters
      • Results:

       

      Default vs Tuned, Ihc24ar_raw, per event, on MI50

      Zoom in 👀

      Gain% (default - tuned)/default * 100

      Saved time per event

      Following

      • Currently tuning MI100
      • Merge in O2 and test in Replay PbPb?
      • Discuss offline about the dynamic parameters implementation
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      ALICE O2 CI-Pipelines on NGT Cluster

      • A fork of the AliceO2 repo is not in the  NextGenTrigggers (NGT) GitHub organization
      • It has a GitHub action running the standalone benchmark on NGT GPUs
      • Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
      • Different GPUs are tested in parallel on different VMs
      • O2 standalone benchmark works on the all ngt-resources:
        • Nvidia H100 188GB NVL
        • AMD Instinct MI300X
        • AMD Radeon Pro W7900
        • Nvidia L40S
      • We are now using custom .par files
      • Gabriele generated even more optimized .par files
      • Next steps
        • Use Gabrieles .par files
      • Possible next steps
        • Add new architectures to O2 hardcoded ones?
        • Store results in csv format and plot them?
        • Display the plot in the github web gui?

      NextGenTrigger Task 1.7

      • Workshop November 19 - 21
        • Need to prepare talk about Task 1.7.
        • Collected summaries from people working on Task 1.7 last week
        • Next Steps: 
          • Interview task leads
          • Create the slides
      • Discussing implementation of our template SoA code in MadGraph with Stefan Roiser
        • Had another discussion with Daniele Massaro about this 
        • Next Steps:
          • Help Daniele Massaro by turning the first class into SoA (beginning of November)
      • Held a meeting with Jolly and Axel from Task 1.7
        • Topic: Merge reflections and template approaches to SoA
        • Problems with template approach: Not working with derived classes or private members
        • Merged with reflections: https://godbolt.org/z/P1xohnxdv
        • Same without, but with boilerplate code: https://godbolt.org/z/ovavdKba7
        • Next Steps:
          • Implement new ideas in template approach

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo
      • Simplified the AoS-SoA wrapper code
      • Started to apply our AoS-SoA code to: 
        • Merger
          • GPUTPCGMSectorTrack
          • GPUTPCGMTrackParam 
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTrack
      • Started with SectorTracker
        • Get same results (same GPU.out files)
        • Execution time was the same =(
          • Changed more classes to SoA
          • Running with large event set (thanks to Gabriele)
          • Performance still same (or maybe 2% slower)
          • Explicit template instantiations are too verbose
      • Next Steps:
        • Make Merger use SoA too
        • Store benchmark results in another format (e.g. csv) to plot it
        • Change CI-Pipeline so that it plots and compares to a baseline
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

      OpenCL

      crash when compiling ZS decoder

      • Submitted reproducer, no new update

      Memory corruption in ZS decoder

      • Reproducer kernel that compiles to basically identical SPIR-V
      • Currently working on host code to reproduce setup in O2

      New GPU server

      • Severs have been delivered
      • Logistics: Need to organize transport to P2 for dev machine?
      • Schedule downtime for alibi to swap GPUs, install new server

      Other

      • Ernst discovered crash in ZS decoder from invalid TPC input
      • Improved error handling in decoder kernel, to handle these cases
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      News from ITS vertexing (Gabriele)

      • Currently porting last step of vertexing

       

       

      Felix: no news

      Just some questions:

      1. How to proceed with validating the GPU tracking in pp, e.g. fixing this TF in-flight issue on epns?
      2. Did you (@David) have time for this memory clearing?
    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)