Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- New builds with GENERIC_NVIDIA / GENERIC_AMD Grid site architecture support working, also virtual sm_75 architecture working generically, we can run on the RTX Pro 6000 in alibicompute with that software.
- Problem that /opt/rocm and nvidia cuda runtime provided by the apptainer nvidia runtime not present in LD_LIBRARY_PATH, Maksim is checking how to fix it. I think this should come from system level, not from O2.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Matthias checked the latest version and can reproduce the issues I reported, he is checking with Sergey. Could be related to extrapolation.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Other topics:
- Ordered connectors and cables for the GPU power supply in the CI server, to build adapter cables manually.
- ONNXRuntime was setting /opt/rocm to LD_LIBRARY_PATH. Don't understand why, but we should do this on system level, not on alidist level. In any case, O2 had it only accidentally. Will remove it and see if something breaks, but if yes should be fixed on system level.
EPN GPU Topics:
- MI210 arrived at CERN? Status of dev2 server?
- 10:20 AM → 10:25 AM
-
10:25 AM
→
10:30 AM
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
News from GPU parameter tuning
- Benchmark executed 8 times + 2 warmup
- Tuned parameters
- With NUMA binding
- With double pipeline
Real PbPb results
Each time is the average time per event in seconds
GPU Event 1 Event 2 Event 3 AMD MI300X 1.07 s
1.19 s 1.21 s AMD Radeon Pro W7900 2.08 s 2.38 s 2.41 s
Nvidia H100 NVL 94GB 0.95 s 1.07 s 1.07 s
Sim PbPb results
GPU Time per event [ms] AMD Radeon Pro W7900 1.87 s Nvidia H100 NVL 94GB 0.86 s - Will update the table every week
- Need to check with deterministic mode
New GPU parametrization
Clang-format wants to change from this:
"GPUTPCNeighboursFinder": {"default": 256,"MI100": [192, 8],"VEGA": [960, 8],"AMPERE": [640, 1],"TURING": [640, 1]},To this:
"GPUTPCNeighboursFinder": {"default": 256,"MI100": [192,8],"VEGA": [960,8],"AMPERE": [640,1],"TURING": [640,1]},Is there a way to suppress clang-format for that specific file?
-
10:30 AM
→
10:35 AM
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
- Implementing our SoA code in MadGraph, together with
- Stefan Roiser
- Daniele Massaro
- No news
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- SectorTracker
- Maybe go back from CRTP to classical inheritance
- Better implementation of iterators
- Next Steps:
- Make better use of SoA to improve performance
- Try David's suggestion
- Implementing our SoA code in MadGraph, together with
- 10:35 AM → 10:40 AM
-
10:40 AM
→
10:45 AM
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (CERN)
OpenCL
No news.
GPU Servers
Dev machine: running at P2. Currently bare bones setup with CUDA and ROCm.
Slurm integration: Giada working on it, some compatibility issues because of mixing Alma 9 & 10 (delayed because of network issues at P2)
Managed to compile O2 and standalone benchmark.
Highly Ionizing Particles
Finished vectorizing kernel (GPU threads load 4-8 charges at once).
Performance barely changes...
Also old kernel was assuming wrong memory layout (pad-major instead of time-major).
Resulted in warps accessing two far apart cachelines. Fixing this also didn't change performance...
Next steps: Try micro benchmarks how to improve access pattern?
-
10:45 AM
→
10:50 AM
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
Felix: At a conference this week. As David suggested found TF which is processed on CPU with peak RSS ~15 GB but not on GPU, identified where it goes OOM and saw also other opportunities to spare memory (at the cost to clear mm more often), will fix soon ~next week, then ask for another round of tests.
Gabriele: Results skewed by bug in the ITS trackleting without primary vertices. Will fix with Felix next week. In the meantime, will check quality of the vertex function with the tracklets computed by the old vertexer
-
10:50 AM
→
10:55 AM
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00 AM
→
10:20 AM