Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- New GPU architecture selection for async in O2DPG looks good, should be merged.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- LLVM Bump to 20.1: merged
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Compilation errors fixed in ROCm trunk, to go into ROCm 7.
- Validated standalone benchmark in deterministic mode, and ran FST.
- Performance regression of ~1.5% without RTC, no regression with RTC. Reported to AMD. Not clear if this will be fixed.
- Serialization bug on MI100 still not fixed, still requires the workaround.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object:
- Compilation issues on GPU fixed.
- Matthias did some checks, found two minor bugs to be fixed by Sergey.
- Speed seems ok, 0.25s for merging 2 maps.
- Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- PR merged, further developments in https://github.com/AliceO2Group/AliceO2/pull/14651.
- Bug in multi-threaded pipeline when timeframes do not arrive in order. Did not happen again after https://github.com/AliceO2Group/AliceO2/pull/14640 was merged, assuming it is fixed.
- gpu-reconstruction quitting with error in some async jobs due to running out of buffers: Problem was due to TPC Sector A11 (the broken one) having much fewer clusters. This, together with Ruben settings for aggressive extrapolation to adjacent sectors, let to too many sector tracks in sector A11, and gpu-reconstruction aborted because the estimated memory size was exceeded. Fixed by taking adjacent sector occupancy into account for the buffer size estimation.
EPN GPU Topics:
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Real data reconstruction
- Used LHC24ar, apass2
- Full chain done for the following configurations:
- GPU CF (default reco)
- NN 0.03, CF regression
- NN 0.03, full
- NN 0.05, full
- NN 0.1 full
PID calibration
- Successful for all periods, quality maintained
- With NN PID calibration
- K0S and Lambda mass peak
- Ratio of track distribution from 200 TFs
- Reduction until max. 300 MeV, as expected from MC
- NCL distribution of tracks from 200 TFs
- Reduction mainly observed for very short tracks
- chi2/NCL (analysis variable) from 200 TFs
- Improvement noticeable for all cases where NN regression is used
- Separation of pion and electron band from selected V0s
- Showing one figure as example (NN, 0.05)
Commissioning runs
- It ran in online (03.10.2025) 🥳🥳🥳
- Noise is significantly higher in real data than in MC
- Adjustment needed for qTot threshold
- Criterion: Adjust until number of tracks and number of clusters roughly match default reco, without using any classification by the NN
- Bonus: Is the current qTot thershold a good choice for the default reconstruction?
- Spoiler: Good for pp but could be loosened for Pb-Pb
- Chosen threshold for online run: qTot \geq 8
- Actual comissioning runs
- 566696: NN, full configuration, threshold 0.05
- 566697: NN classification, threshold 0.05 + heuristic regression
- Unfortunate mistake: 566697 also has qTot \geq 8, even though it uses the heuristic regression
- Previous investigation: Makes a difference of 2.8% clusters and 0.7% tracks (pp, 1 MHz)
- Realistic check on CTF size: more like 1.4% clusters, because the run was a pp, 500 kHz
- Data-size reduction: GB / lumi
- Shows expected behavior of 9.4% reduction in total data volume
- Electron-pion separation power at MIP, not V0 but full distribution
- Left: NN, Right: Default reco from another run
- Improves both electrons (1.25%) and pions (5%). Effect is not as strong as in Pb--Pb and not as strong as on V0 sample (because of potential surrounding noise)
- Separation power also improves by 1.3%
- Shared cluster distribution
- In any case not expected to be critically dominated by shared clusters in low-rate pp, but still good to check
- First: Check absolute counts
- Small, relative improvement close to bin at 0 will have most dominant effect (by orders of magnitude)
- Expected behavior: Increase peak close to 0 (relative to reference run), decrease across the rest of shared clusters
- To be plotted as a ratio to the reference run by strength of relative contribution to absolute track / cluster count
Something at higher level: Reconstruction of D0
Used BDT from Fabrizio Grosa for investigation on two different reconstructions: Default and NN (full) with threshold 0.05
- Wrote an OPTUNA based optimization class that optimises the input variables to a given score metric
- Chosen score metric:
- w1 = 0.2, w2 = 0.8
- signal: gauss, background: pol3 (tried also exp, both work well)
- Chosen score metric:
- Result (same cuts applied to both datasets)
-
- First: default reco: 'signal': 32.007845773858556, 'sigma': 0.026497547051550507, 'gauss peak': 4.81904342e+02
-
- Second: NN full, 0.05: 'signal': 26.3774060396624, 'sigma': 0.023521459662659887, 'gauss peak': 4.47381356e+02
So: Default reco is still "better", but BDT was used from default reco in the same way with the same cuts for both reconstructions.
After talking to Fabrizio Grosa: The excess in background and signal could actually be real... This heavily depends on ITS-TPC matching, DCA and other observables. Not necessarily anything wrong in the reco.Finally also something cool
3D render of one of the first collisions recorded with the neural network cluster finder in online reco! Full credit to Felix Schlepper
-
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
- 10:35 → 10:40
-
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
-
10:45
→
10:50
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
Felix:
Since last time added 'feature' to cpu-allocator to use the pinned memory from the gpu-framework, now all to gpu transferred data is prepared in pinned host memory and transferred via DMA (https://github.com/AliceO2Group/AliceO2/pull/14681). Previously, we allocated via the system allocator and pinned the memory ourselves afterward (also unpinned) for every TF, while I did not measure the impact on the timing, it should be obvious that this is better.
Waiting for recipe to clear parts of the memory in-between iterations. Then repeat test production.
What else is needed to commission ITS GPU tracking for async?
News from ITS vertexing (Gabriele)
- Partial porting of the vertex seeding by Felix
- Will port rest of the algorithm
- After that, I would like to start a campaign of optimization of the vertexing
- Is this the best way to split the work among GPU threads?
- Is this the best way to deliver the data to GPU?
- Is this the best algorithm we can use? Can we find a more GPU-friendly one? (s.t. the CPU version is also optimized, and determinism is not broken)
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20