Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- New GPU architecture selection for async in O2DPG looks good, should be merged.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- LLVM Bump to 20.1: status?
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- ROCm 7 with full serialization passes validation in deterministic mode. Received instructions from AMD how to build the current compiler so we can check the performance.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
- Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Probably a bug in multi-threaded pipeline when timeframes do not arrive in order. Cannot reproduce, in my test with non-monotonous timeframe order, pipeline works well. Not following up further for now.
- Bug in reading of MC data, temporary fix by Ruben applied. Implemented a proper fix.
EPN GPU Topics:
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Mostly thesis writing now. But I have some questions / points of discussion
ITS-TPC matching efficiency
Context:
- TPC clusters and tracks are created with given algorithmic settings from the plot label
- ITS tracks are kept the same -> They come from the previous reconstruction

-> For high-pT (close to the center of the plot) the matching efficiency is systematically higher for the algorithms with the GPU CF regression
Since these are primaries, I would expect most high-pT particles to be at central eta. But against eta I see a reversed trend. Matching efficiencies are here the same or slightly higher than with the GPU CF (see below).
I then tested if this is a matter of the threshold setting, but it is not:

This is puzzling to me. Two options come to my mind:
- Is the ITS-TPC alignment done with TPC tracks or with comics?
- Can the ITS tracks change with a different TPC reconstruction (i.e. different number of TPC tracks, different track parameters)?
As a ratio to the TPC tracks, this efficiency improves with the new algorithm

---------------
And since I have never actually shown it in this meeting: The width estimation of qTot using the different algorithms

-
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
News from GPU parameter tuning
Still fighting with the launch bound fix:
- Need to update
dumpGPUDefParam:
void dumpGPUDefParam(const char* outputfile = "parameters.out"){auto param = o2::gpu::internal::GPUDefParametersLoad();printf("Loaded params:\n%s\nWriting them to %s\n", o2::gpu::internal::GPUDefParametersExport(param, false).c_str(), outputfile);FILE* fp = fopen(outputfile, "w+b");fwrite(¶m, 1, sizeof(param), fp);fclose(fp);}
Proposed solution: define WARP size in the header file and addminBlockFactorto the call toGPUDefParametersExportand dump that export somehow - RTC fails with @David's patch:
29797 | extern "C" {__attribute__((global)) void GPUCA_KRNL_REG_DEFAULT((GPUCA_LB_GPUTPCNeighboursFinder)) krnl_GPUTPCNeighboursFinder( int32_t _iSector_internal ) { __attribute__((shared)) typename GPUTPCNeighboursFinder::GPUSharedMemory smem; GPUTPCNeighboursFinder::template Thread<GPUTPCNeighboursFinder::defaultKernel>((gridDim.x), (blockDim.x), (blockIdx.x), (threadIdx.x), smem, GPUTPCNeighboursFinder::Processor((gGPUConstantMemBuffer.v))[_iSector_internal] ); }}
Fails to expand the macro, suspect that it is not defined there, I am investigating
ALICE contribution to HS23
- Completed
build.sh. To build, just rundocker run -v /cvmfs:/cvmfs alice_gpu_hs23 --backend HIP --arch gfx908 - Need to test build for multiple archs
- Warnings:
CMake Warning (dev) at /opt/rocm/lib/cmake/hip/hip-config-amd.cmake:91 (message):AMDGPU_TARGETS was not set, and system GPU detection was unsuccsesful.The amdgpu-arch tool failed:Error: 'Failed to get device count'Output: ''As a result, --offload-arch will not be set for subsuqentcompilations, and the default architecture(gfx906 for dynamic build / gfx942 for static build) will be usedCall Stack (most recent call first):/opt/rocm/lib/cmake/hip/hip-config.cmake:149 (include)/alice_hs23/O2/dependencies/FindO2GPU.cmake:258 (find_package)CMakeLists.txt:130 (find_package)This warning is for project developers. Use -Wno-dev to suppress it.But then:
-- Building GPUTracking with HIP support (GPU Target gfx908)-- Using optimized HIP settings for MI100 GPU
And in the actual compilation commands appears--offload-arch=gfx908 - First draft of
run.shwill use dataset within the container - When Robin is back from vacation, we will test on their machines
- Need to update
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
ALICE O2 CI-Pipelines on NGT Cluster
- A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
- It has a GitHub action running the standalone benchmark on NGT GPUs
- Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
- Different GPUs are tested in parallel on different VMs
- O2 standalone benchmark works on the all ngt-resources:
- Nvidia H100 188GB NVL
- AMD Instinct MI300X
- AMD Radeon Pro W7900
- Nvidia L40S
- We are now using custom .par files
- Next steps
- Generate optimized .par files with Gabriele
- Possible next steps
- Add new architectures to O2 hardcoded ones?
- Store results in csv format and plot them?
- Display the plot in the github web gui?
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo
- Simplified the AoS-SoA wrapper code
- Started to apply our AoS-SoA code to:
- Merger
-
GPUTPCGMSectorTrack
-
GPUTPCGMTrackParam
-
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTrack
- Merger
- Started with SectorTracker, no problems until now
Preparation for ACAT Conference (8. - 12. September)
- Presenting with Jolly our joint work on AoS/SoA
- Jolly: SoA using reflection (C+++26)
- Oliver: SoA using templates (C++17)
- We will present the benchmark results comparing different AoS/SoA implementations
- Fixed some issues with a experimental compiler that supports reflections
- Problems:
- Example where AoS is faster than SoA on CPU?
- Sometimes loops over SoA are not vectorized with the new experimental compiler for reflections
-
10:35
→
10:40
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
-
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
OpenCL
Resolved issues:
- crash when compiling stream compaction kernel
- Fixed upstream
- Fun usecase for LLMs: struggled for multiple days to reproduce this in a standalone kernel. Gave Claude Opus kernel source, disassembled SPIR-V and source of PoCL function that crashes -> a few seconds later had a working reproducer...
Open issues with PoCL:
- crash when compiling ZS decoder:
- Submitted reproducer, investigated by PoCL devs
- Memory corruption in ZS decoder
- crash when compiling stream compaction kernel
-
10:45
→
10:50
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
Still in the HCP school, then on vacation for 2 weeks, no updates except what Gabriele will do. (-Felix)
ITS-GPU-tracker parameter tuning (Gabriele)
- Adjusted the tuner for the ITS tracking
- Need to evaluate the results
- Objective function: output from its tracker (no profiler)
- Seems a bit unreliable, need to investigate if profiler is needed
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20