Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Need to check ROCm 7.2 corrtecness.
- Need to understand and fix crash on RTX Pro 6000.
- Understand deterministic mode issue on NVIDIA Blackwell.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Maps now yielding correct results, but 1.5x performance regression running on GPUs.
- Could not find the problem yet (also did not have much time so far). The problem is that I cannot bisect it, since intermediate commits do not run on GPU. Probably have to rewrite the PR and split it in steps that can run on GPU, to see what triggers the slowdown.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- TPC decoding can now handle data that was encoded with wrong b field, memory scaling factors parameters are configurable as --configKeyValue.
Other topics:
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
- Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
- CI fully green now. Need to check on EPNs, and have to do a special build for anaylzers to check, particular the bump of cgal.
- Test at NERSC still ongoing, all jobs so far failed for non-gpu related reasons (issues with the farm, no free slots for several days, wrong JDL, old JAlien version, ...)
EPN GPU Topics:
-
10:20 AM
→
10:25 AM
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Cluster error tuning
- Automatized search (choose half widths around optimum and submit next iteration)
- Adjusted search metric: Correctly attached non-fake * average efficiency (primaries)
- Efficiency improved slightly
- Adjusted network input: All cluster flags, invCharge, invAvgCharge, mult
- That didn't change much...
- Added new scaling parameters for chi cuts in tracking

- First search: scaleChiY1, scaleChiZ1, nnScaleClusterErrorY, nnScaleClusterErrorZ



- Next step: Need new (more realistic) simulation
- Tighten cuts on training data further to have a better cluster error estimate
-
10:25 AM
→
10:30 AM
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
GPU Parameters news
Tuned MI210 without RTC
Event 0: 4.24 s --> 3.15 s (- 25,7 % )Event 1: 4.60 s --> 3.70 s (-19,6 %)
Event 2: 4.77 s --> 3.77 s (-21,0 %)
-
10:30 AM
→
10:35 AM
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
- CHEP talk together with Jolly got accepted.
- Presenting "proof of concept" to imrove CMS clustering algorithm CLUE to Felice and NGT.
- They liked it and they want to work it out.
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- GPUTPCTrack
- SectorTracker
- Worked on pull-request for O2.
- Tried also Gabriele's benchmarking tool.
- See appended CSV file or below for a comparison to standalone benchmark.
- Made it part of CI-pipeline that benchmarks on NGT GPUs.
- Next Steps:
- Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
- Make better use of SoA to improve performance
- Try David's suggestion
Comparison Standalone / rocprofv2
name time (standalone) time (rocprofv2) absolute difference
GPUTPCNeighboursCleaner 4952.69666666667 3891.71675 0.214222672631543 GPUTPCNeighboursFinder 37099.65 35538.7530833333 0.042073090087553 GPUTPCStartHitsFinder 1509.4375 725.102916666667 0.519620443597919 GPUTPCStartHitsSorter 3609.17333333333 2646.44583333333 0.266744600795011 GPUTPCTrackletConstructor 1225978.33333333 1224653.22916667 0.001080854474046 GPUTPCTrackletSelector 36015.6333333333 34149.2568333333 0.051821287792616 -
10:35 AM
→
10:40 AM
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
-
10:40 AM
→
10:45 AM
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (CERN)
OpenCL
No news.
GPU Servers
CI Server: No update from Sergio yet.
Highly Ionizing Particles
Opened PR for the current implementation.
Filter Performance
Reran filter on new simulation data with Mesut's suggestions. (brackets: old values)
Good:
- False positive rate went down.
Bad:
- Still only about 40% of tails filtered.
Ugly:
- Even threshold of 5 ADC, filter performance doesn't change meaningfully...
Confusion matrixCutoff Threshold Correctly Zeroed Tails Leaked Correctly Kept False Positive Precision (# correctly zeroed / # total zeroed) 150 ADC 0.4142 (0.3910) 0.5858 (0.6090) 0.9987 (0.9977) 0.0013 (0.0023) 0.6513 (0.5370) 50 ADC 0.4301 (0.4115) 0.5699 (0.5885) 0.9987 (0.9974) 0.0013 (0.0026) 0.6597 (0.5161) 5 ADC 0.4782 0.5218 0.9986 0.0014 0.6719 Saturated Clusters
- First Proof-Of-Concept implementation in place.
- Still need to evaluate clusters
- Find a way to parallelize implementation
-
10:45 AM
→
10:50 AM
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
Gabriele:
New implementation:
Precision: 0.9928, Recall: 0.9133, F1: 0.9514
Total true vertices: 150
Matched vertices: 137
Missed true vertices: 13
Total fitted vertices: 138
Fake vertices: 1
Mean signed residuals (x,y,z): [0.00038846 0.00010329 0.00066991]
Std signed residuals (x,y,z): [0.00853282 0.00977771 0.0052135 ]
Mean absolute residuals (x,y,z): [0.0043629 0.00525042 0.00273546]
Std absolute residuals (x,y,z): [0.00733391 0.00823679 0.00448273]Missed true vertices:
ROF x y z nContributors
11.0 -0.000259 0.022460 -3.69227 10.0
13.0 -0.000585 0.002898 8.07200 11.0
30.0 -0.024436 -0.009381 7.03565 1.0
35.0 -0.023708 -0.009344 -5.51584 1.0
48.0 -0.010583 -0.008603 -3.05986 14.0
75.0 0.011355 -0.002970 1.90074 1.0
90.0 0.002929 -0.018336 2.66971 1.0
102.0 0.001209 0.020929 -4.71790 1.0
123.0 -0.001262 -0.002698 8.83149 22.0
128.0 0.006334 -0.001214 -5.15652 10.0
158.0 -0.022409 -0.003296 -7.26120 3.0
161.0 0.002261 0.001900 -10.54620 1.0
164.0 -0.008004 0.000373 -5.64786 8.0
Old implementation:Precision: 1.0000, Recall: 0.9067, F1: 0.9510
Total true vertices: 150
Matched vertices: 136
Missed true vertices: 14
Total fitted vertices: 136
Fake vertices: 0
Mean signed residuals (x,y,z): [-7.04813868e-05 1.47534449e-04 3.28789706e-04]
Std signed residuals (x,y,z): [0.00157584 0.00351721 0.00311071]
Mean absolute residuals (x,y,z): [0.00092318 0.00105691 0.00067765]
Std absolute residuals (x,y,z): [0.00127659 0.00335669 0.00305332]Missed true vertices:
ROF x y z nContributors
11.0 -0.000259 0.022460 -3.69227 10.0
30.0 -0.024436 -0.009381 7.03565 1.0
35.0 -0.023708 -0.009344 -5.51584 1.0
48.0 -0.010583 -0.008603 -3.05986 14.0
49.0 -0.011022 0.000661 -3.79249 39.0
75.0 0.011355 -0.002970 1.90074 1.0
90.0 0.002929 -0.018336 2.66971 1.0
97.0 0.011611 0.000815 0.00732 28.0
102.0 0.001209 0.020929 -4.71790 1.0
123.0 -0.001262 -0.002698 8.83149 22.0
128.0 0.006334 -0.001214 -5.15652 10.0
158.0 -0.022409 -0.003296 -7.26120 3.0
161.0 0.002261 0.001900 -10.54620 1.0
164.0 -0.008004 0.000373 -5.64786 8.0Next steps:
- Use different trackleting and improve fitting to reduce residuals
- Attempt another approach suggested by Felix still using Ray Tracing techniques
- Move from sampling of line inside voxels to voxel traversing
-
10:50 AM
→
10:55 AM
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00 AM
→
10:20 AM