Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • New GPU architecture selection for async in O2DPG looks good, should be merged.
      • Test with GPU GRID jobs at NERSC pending.
      • Asked DPG to run first test with ITS tracking on GPU on EPNs.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • LLVM Bump to 20.1: status?
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • ROCm 7 with full serialization passes validation in deterministic mode. Received instructions from AMD how to build the current compiler so we can check the performance.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
      • Pending OpenCL2 issues:
        • printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
        • Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
        • Also with optimization disabled, crashing later in TPC merging, need printf to debug.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • Probably a bug in multi-threaded pipeline when timeframes do not arrive in order. Cannot reproduce, in my test with non-monotonous timeframe order, pipeline works well. Not following up further for now.
      • Bug in reading of MC data, temporary fix by Ruben applied. Implemented a proper fix.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Mostly thesis writing now. But I have some questions / points of discussion

      ITS-TPC matching efficiency

      Context:

      • TPC clusters and tracks are created with given algorithmic settings from the plot label
      • ITS tracks are kept the same -> They come from the previous reconstruction

       

      -> For high-pT (close to the center of the plot) the matching efficiency is systematically higher for the algorithms with the GPU CF regression
      Since these are primaries, I would expect most high-pT particles to be at central eta. But against eta I see a reversed trend. Matching efficiencies are here the same or slightly higher than with the GPU CF (see below).

      I then tested if this is a matter of the threshold setting, but it is not:

      This is puzzling to me. Two options come to my mind:

      1. Is the ITS-TPC alignment done with TPC tracks or with comics?
      2. Can the ITS tracks change with a different TPC reconstruction (i.e. different number of TPC tracks, different track parameters)?

       

      As a ratio to the TPC tracks, this efficiency improves with the new algorithm

       

      ---------------

      And since I have never actually shown it in this meeting: The width estimation of qTot using the different algorithms

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      Still fighting with the launch bound fix:

      • Need to update dumpGPUDefParam:

        void dumpGPUDefParam(const char* outputfile = "parameters.out")
        {
        auto param = o2::gpu::internal::GPUDefParametersLoad();
        printf("Loaded params:\n%s\nWriting them to %s\n", o2::gpu::internal::GPUDefParametersExport(param, false).c_str(), outputfile);
        FILE* fp = fopen(outputfile, "w+b");
        fwrite(&param, 1, sizeof(param), fp);
        fclose(fp);
        }

        Proposed solution: define WARP size in the header file and add minBlockFactor to the call to GPUDefParametersExport and dump that export somehow

      • RTC fails with @David's patch:
         29797 | extern "C" {__attribute__((global)) void GPUCA_KRNL_REG_DEFAULT((GPUCA_LB_GPUTPCNeighboursFinder)) krnl_GPUTPCNeighboursFinder( int32_t _iSector_internal ) { __attribute__((shared)) typename GPUTPCNeighboursFinder::GPUSharedMemory smem; GPUTPCNeighboursFinder::template Thread<GPUTPCNeighboursFinder::defaultKernel>((gridDim.x), (blockDim.x), (blockIdx.x), (threadIdx.x), smem, GPUTPCNeighboursFinder::Processor((gGPUConstantMemBuffer.v))[_iSector_internal] ); }}

        Fails to expand the macro, suspect that it is not defined there, I am investigating

       

      ALICE contribution to HS23

      • Completed build.sh . To build, just run 
        docker run -v /cvmfs:/cvmfs alice_gpu_hs23 --backend HIP --arch gfx908
      • Need to test build for multiple archs
      • Warnings:

        CMake Warning (dev) at /opt/rocm/lib/cmake/hip/hip-config-amd.cmake:91 (message):
           AMDGPU_TARGETS was not set, and system GPU detection was unsuccsesful.

           The amdgpu-arch tool failed:
           Error: 'Failed to get device count'
           Output: ''

           As a result, --offload-arch will not be set for subsuqent
           compilations, and the default architecture
           (gfx906 for dynamic build / gfx942 for static build) will be used

        Call Stack (most recent call first):
          /opt/rocm/lib/cmake/hip/hip-config.cmake:149 (include)
          /alice_hs23/O2/dependencies/FindO2GPU.cmake:258 (find_package)
          CMakeLists.txt:130 (find_package)
        This warning is for project developers.  Use -Wno-dev to suppress it.

        But then:

        -- Building GPUTracking with HIP support (GPU Target gfx908)
        -- Using optimized HIP settings for MI100 GPU

        And in the actual compilation commands appears --offload-arch=gfx908

      • First draft of run.sh will use dataset within the container
      • When Robin is back from vacation, we will test on their machines
    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      ALICE O2 CI-Pipelines on NGT Cluster

      • A fork of the AliceO2 repo is not in the  NextGenTrigggers (NGT) GitHub organization
      • It has a GitHub action running the standalone benchmark on NGT GPUs
      • Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
      • Different GPUs are tested in parallel on different VMs
      • O2 standalone benchmark works on the all ngt-resources:
        • Nvidia H100 188GB NVL
        • AMD Instinct MI300X
        • AMD Radeon Pro W7900
        • Nvidia L40S
      • We are now using custom .par files
      • Next steps
        • Generate optimized .par files with Gabriele
      • Possible next steps
        • Add new architectures to O2 hardcoded ones?
        • Store results in csv format and plot them?
        • Display the plot in the github web gui?

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo
      • Simplified the AoS-SoA wrapper code
      • Started to apply our AoS-SoA code to: 
        • Merger
          • GPUTPCGMSectorTrack
          • GPUTPCGMTrackParam 
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTrack
      • Started with SectorTracker, no problems until now

      Preparation for ACAT Conference (8. - 12. September)

      • Presenting with Jolly our joint work on AoS/SoA
        • Jolly: SoA using reflection (C+++26)
        • Oliver: SoA using templates (C++17)
      • We will present the benchmark results comparing different AoS/SoA implementations
      • Fixed some issues with a experimental compiler that supports reflections
      • Problems:
        • Example where AoS is faster than SoA on CPU?
        • Sometimes loops over SoA are not vectorized with the new experimental compiler for reflections
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

      OpenCL

      Resolved issues:

      • crash when compiling stream compaction kernel
        • Fixed upstream
        • Fun usecase for LLMs: struggled for multiple days to reproduce this in a standalone kernel. Gave Claude Opus kernel source, disassembled SPIR-V and source of PoCL function that crashes -> a few seconds later had a working reproducer...

       

      Open issues with PoCL:

      • crash when compiling ZS decoder:
        • Submitted reproducer, investigated by PoCL devs
      • Memory corruption in ZS decoder
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Still in the HCP school, then on vacation for 2 weeks, no updates except what Gabriele will do. (-Felix)

      ITS-GPU-tracker parameter tuning (Gabriele)

      • Adjusted the tuner for the ITS tracking
      • Need to evaluate the results
      • Objective function: output from its tracker (no profiler)
        • Seems a bit unreliable, need to investigate if profiler is needed
    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)