Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Need to check ROCm 7.2 corrtecness.
- Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Check for unnecessary f64 instructions in GPU code.
- PR with changes for qTot-reading / dEdx code, to support >16 bit range for saturated signals.
Other topics:
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
EPN GPU Topics:
- Test of MI50-EPN equipped with 1 RTX Pro 6000 quite successful.
- GPU temperature saturates at 78°C under full load.
- At high temperature, seeing up to 2% degradation in GPU clocks and GPU performance.
- 1 RTX Pro 6000 can run at full speed in FST shared by both CPU sockets / NUMA domains.
- Right now, EPN with 1 RTX Pro 6000 Max-Q (300W TDP= reaches 5/6th of the performance of 1 8-MI50-EPN.
- (From CPU side, would need ~280 64-core EPNs, but then GPU performance should be slightly above 8*MI50).
- Perhaps 1 RTX Pro Server GPU (600W) might already be enough, 2 * RTX Pro 5000 might be another option. Almost certainly, next gaming GPU generation will be fast enough.
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Cluster finder
- Successful Pb--Pb runs last friday 🥳
- Some preliminary results from cpass0:
- 16% CTF size reduction
NN improves separation power by ~10%!
- DCA distributions maintained
- Shared cluster map very similar
- NCl/track shows the expected reduction: Approx. 2-3 clusters for long tracks
GPU benchmarking
- PR: https://github.com/AliceO2Group/AliceO2/pull/15514
- Benchmark single GPU performance across servers (testing on epn000 and dev00 server)
- Multithreading CPU processes for ITS reduces downtime between gpu-reco processes enough to make measurement
- Top: Default reco, Bottom: NN full reco. One TF of LHC24ar. TFLOOP=100
- Allowed downtime between gpu processing steps: 50ms
All detectors, default settings
- epn000


- pdp-dev00 (RTX5080), single GPU:
- [1] Downtimes too large for default reco (80-130 ms even when setting CPU process threads to 24) -> Best guess-timate
- Wall-time mean including allowed gaps: 0.390235 s
Individual duration sample mean: 0.364311 s
Individual duration sample sigma: 0.00405494 s
- Wall-time mean including allowed gaps: 0.390235 s
- [2]
- Wall-time mean including allowed gaps: 0.607306 s
Individual duration sample mean: 0.587373 s
Individual duration sample sigma: 0.0101604 s
- Wall-time mean including allowed gaps: 0.607306 s
- [1] Downtimes too large for default reco (80-130 ms even when setting CPU process threads to 24) -> Best guess-timate

Summary:
- Processing time increase using MI50 NN / default reco: ~1.5
- Processing time increase using RTX5080 NN / default reco (to be taken with a grain of salt): ~1.5
-> No improvement (so far) for NN using RTX5080 over MI50 (concerning hardware acceleration)
Measurement with "WORKFLOW_DETECTORS": "TPC,CTP"
- epn000


- pdp-dev00 (RTX5080):


Summary:
- Seems much more precise and with less losses than using all detectors. Can load GPU much better (way less losses of TFs due to exclusions...)
-
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
PR with opt params
- PR is online
- Added two architectures to FindO2GPU.cmake
- Hopper
- MI300 (GCNA 4.0)
- Changed Ampere detection from cc 8.6 to 8.0 (why it was like this?)
if(CUDA_FIRST_TARGET GREATER_EQUAL 120) set(CUDA_TARGET BLACKWELL) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 90) set(CUDA_TARGET HOPPER) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 89) set(CUDA_TARGET ADA) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 80) # was 86 before set(CUDA_TARGET AMPERE) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 75) set(CUDA_TARGET TURING) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 60) set(CUDA_TARGET PASCAL) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 30) set(CUDA_TARGET KEPLER) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 20) set(CUDA_TARGET FERMI) else() set(CUDA_TARGET TESLA) endif()Best parameters comparison - Nvidia
Kernels in single stream steps




Kernels in multi stream steps



Next steps
- Validate against manual tuning
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- GPUTPCTrack
- SectorTracker
- Writing ACAT Proceedings
- Next Steps:
- Make better use of SoA to improve performance
-
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (CERN)
-
10:45
→
10:50
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
Gabriele
Tested new parallel algorithm on CPU, similar to current implementation but with a parallel approach:
- For each track, find z component of DCA to beamline
- Sort tracks per z value
- For each track, count how many tracks fall into a window centered in z (and are time compatible) --> density
- For each track, check if it is a local maximum (look at neighbours)
- For each local maximum, fit tracks which falls into the windows
- For each seed, check if it is not a duplicate with neighboring seeds
- Surviving seeds promoted to vertices
Current version:
[1727121:its-tracker]: [17:12:06][INFO] - Vertex finding: found 132 vertices (total 132) in 50.63 msParallel version (20 threads):
[1725878:its-tracker]: [17:05:17][INFO] - Vertex finding: found 132 vertices (total 132) in 14.35 msPreliminary results from 4 simulated Pb-Pb 50kHz TF:


Next steps
- Validate with more TFs
- Optionally tune parameters
- Implement the GPU version
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20