Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- New GPU architecture selection for async in O2DPG looks good, should be merged.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- LLVM Bump to 20.1: status?
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Compilation errors fixed in ROCm trunk, to go into ROCm 7.
- Validated standalone benchmark in deterministic mode, and ran FST.
- Performance regression of ~1.5% without RTC, no regression with RTC. Reported to AMD. Not clear if this will be fixed.
- Serialization bug on MI100 still not fixed, still requires the workaround.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object:
- Compilation issues on GPU fixed.
- Matthias did some checks, found two minor bugs to be fixed by Sergey.
- Speed seems ok, 0.25s for merging 2 maps.
- Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Now has general improvements in all aspects: higher efficiency, lower clone rate, more clusters rejected online, better cluster to track association, faster on AMD GPUs (though slightly slower on NVIDIA and CPU).
- Trying to understand one problem found by Ruben where it gets stuck processing a TF, then we might merge it to benefit from the improevements.
- Further developments in different branch on top.
- Bug in multi-threaded pipeline when timeframes do not arrive in order. Happened again, Ernst and Giulio obtained a stack trace and attached a gdb. Not fully clear why, but the completion policy is stuck. Tentative fix for a possible deadlook between the completion policy and DPL is here https://github.com/AliceO2Group/AliceO2/pull/14640. Need to test online if the problem goes away.
- gpu-reconstruction quitting with error in some async jobs due to running out of buffers, need to check but probably bad TPC data.
EPN GPU Topics:
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Online runs - summary
- Tested multiple configurations
- Almost all failed
- Identified reasons (out-of-bound writes due to more thread-spawning than requested), fixed and PR merged yesterday: https://github.com/AliceO2Group/AliceO2/pull/14657
Physics: ITS-TPC matching
- First investigation on its-tpc debug workflow was a bit inconclusive
- Second investigation on debug trees leads to more results:
A = Simulation with CF regression, B = simulation with regression network (should show matching losses) - Shown: ITS-TPC matched tracks found in one simulation but not in the other. The plotted parameters will then be from the first simulation.
(selected by the same isPhysicalPrimary as the histograms that show the losses)
Reminder: Major losses observed in 0.2 < | q/pT | < 2, ...
... as found in the debug trees:
Z:
sin(phi):
tan(lambda):
Reason why I didn't see them in the eta distribution: Window of the plot was too small.
nClustersTPC:
So many of the lost tracks are at NCL -> 60. How about the ones with NCL > 100?
Let's have a look at the inverse situation: ITS-TPC matches found in B but not in A:
-> NCL > 100 similar here (so maybe its just an MC label mismatch: {track ID, event ID, source ID})
Finally, the loss is found:
So the matching efficiency loss is now understood and comes from the high inclination region. Q: Anything to do about it?
There is also no (significant) loss in TPC efficiency at high eta, independent of the threshold:
-> Conclusion: This must be an effect of changed track parameters.
-
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
News from GPU parameter tuning
Changes after the mismatch fix
- Tuned PbPb 50 KHz after fixing the HIP/CUDA mismatch
- Lowered grid sizes to maximum of 600 (kept max 900 only for some small kernels)
- Lowered number of trials
- New constraint in the optimization function:
if min_block_per_cu * max_threads / 256 > 10:
return float("inf")
Penalize configurations that would make the second parameter > 10 - In this way, avoid kernels with no register spillage (since the compiler ignores min_active_warps > 10) but with high grid size
- Forces the algorithm to focus on small but good configurations
-
Best parameters perf [ms] Before mismatch fix After mismatch fix + constraints Clusterizer step 732.18 757.02 CompressionKernels_step1unattached 523.69 524.31 GMMergerCollect 93.86 94.40 GMMergerSectorRefit 284.58 267.72 GMMergerTrackFit 761.78 704.26 FollowLoopers + CompressionKernels_step0attached 508.43 480.44 TrackletConstructor step 1091.24 > 4000
For the tracklet constructor step, since it is highly dimensional, it was more probable to fall into the condition
min_block_per_cu * max_threads / 256 > 10
, leading to many inf values and few acceptable values. Removing the constraint lead to an optimal solution of 1128.05.Version Mean sync time [ms] Default - before mismatch fix 4679.83 ms ± 3.64 ms Default - after mismatch fix 4682.61 ms ± 5.77 ms Tuned - before mismatch fix 4317.97 ms ± 4.17 ms (7,73%) Tuned - after mismatch fix + extra constraints 4315.11 ms ± 3.06 ms (7,79%)
Reduced grid sizes in best parameters. Some examples:Kernel Best conf now Best conf before GMMergerCollect 1024x120 768x600 GMMergerSectorRefit 64x480 448x840 GMMergerTrackFit 64x360 64x900
CompressionKernels_step0attached 64x180 192x60
GMMergerFollowLoopers 128x240 64x900
NVIDIA GPUs tuning
Got access to the NGT cluster. Started to adapt the tuner for Nvidia GPUs.
ToDos:
- Investigate why big grid sizes with low occupancy are preferred rather than the small grid sizes with same occupancy
- Decide if how to integrate new default parameters in O2
ALICE contribution to HS23
- Created repository for the Dockerfile: https://github.com/cima22/O2GPU_HS23
- Waiting reply by Robin for testing
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
-
10:35
→
10:40
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
Working with the TODOs:-
A Compiler directive (GCC diagnostic need to be removed in GPU/GPUTracking/Base/hip/GPUReconstructionHIPIncludesSystem.h class,
but checked that the entire class not needed. If I remove the class then also, no make error.
In this class there are few includes only from HIP and Thrust. So we may remove it fully.
-
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
OpenCL
crash when compiling ZS decoder
- Submitted reproducer
- Last update from PoCL devs: Not a trivial fix, requires rewriting one step in compiler pipeline
Memory corruption in ZS decoder
- Looks like miscompilation after all
- Issue disappears when running single-thread decoder via OpenCL
- Kernel consists of two parts
- Decode header region in page into shared memory
-
- Decode adc data from page in global memory
- Replacing header decoding with single-threaded variant: Issue almost disappears, but some wrong values still written
- Dumping written values with printf:
- OpenCL version writes correct values except for memory corruption
- So compared to CPU version: expect identical output + bunch of additional writes
- Actual output: OpenCL writes less values than CPU + almost no overlap in output???
- -> printf affected by miscompilation, so useless here?
New GPU server
Final configuration:
- Dev machine with 64 core threadripper + RTX 5080 (& MI50)
- New 2U server for alibi, repurpose old alibi to CI GPU machine
Order with Guy. Waiting for update.
-
10:45
→
10:50
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20