Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
- ROCm 7.2 released, waiting for EPN to set it up with new dev2 server.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Sergey is checking.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Other topics:
- Molex connectors for GPU CI server arrived, still waiting for cables.
- Removed bogus setting of rocm path in LD_LIBRARY_PATH by ONNXRuntime recipe. Seems to have no side effects.
EPN GPU Topics:
- Status of power cable problem for dev2 server and MI210?
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- Fixed problem of cuda and rocm library detection: https://github.com/alisw/alidist/pull/6093
- Highlighted problem about GPU architecture recognition (PR by David)
- Highlighted problem about virtual py-env aliBuild not being used for building (PR by Giulio)
Currently working on dumping the cluster errors in the right place (work ongoing now since I was on vacation for some days)
-
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
News from GPU parameter tuning
About L40S
Note: L40S uses the
AMPEREdefault parametersFor real PbPb data:
Sync mean time default: 2018.66 ms ± 167.23 msSync mean time optimised: 1300.35 ms ± 92.61 msPerformance gain with tuned params: 35.58%
For sim PbPb data:Sync mean time default: 1677.37 ms ± 2.52 msSync mean time optimised: 1031.51 ms ± 2.85 msPerformance gain with tuned params: 38.50%New GPU parametrization
- PR is online [link]
- I will update with the logic of the new GPU detection once it is merged
- Possible to suppress clang for that json?
- MI100 bug?
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
- Implementing our SoA code in MadGraph, together with
- Stefan Roiser
- Daniele Massaro
- No news
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- SectorTracker
- Maybe go back from CRTP to classical inheritance
- Better implementation of iterators to support
- std::sort
- thrust::sort
- our custom sort function
- Next Steps:
- Make better use of SoA to improve performance
- Try David's suggestion
- Implementing our SoA code in MadGraph, together with
-
10:35
→
10:40
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
-
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (CERN)
OpenCL
No news.
GPU Servers
Dev machine: no news.
Highly Ionizing Particles
Quick followup to last week:
Finished vectorizing kernel (GPU threads load 4-8 charges at once).
Performance barely changes...
Also old kernel was assuming wrong memory layout (pad-major instead of time-major).
Pad filter kernel parallelises across pads. So switching to time-major layout and vectorizing memory access also reduces number of threads by same amount...
Fixed parallelization across rows:
- Fixed memory access to time-major layout
- Fixed bank conflicts when accessing shared memory
Increases throughput from 80 GB/s to 128 GB/s. Basically identical performance to current version (130 GB/s).
TODO: Needs testing on Nvidia.
Pad filter and cluster finder currently assume 64 byte cache lines, Nvidia moved to 128 byte cache lines a couple generations ago.
-
10:45
→
10:50
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
Felix:
Found TF which cannot be processed on GPU but survives CPU code (peak RSS ~15GB).
Implemented more memory clearing in https://github.com/AliceO2Group/AliceO2/pull/14999 now the same TF survives on GPU (peak allocation ~18 GB).
Note though, that this is at the price of more device<->host<->device data movement.
This should care of the bulk of the (smallish) difference. Remaining: verify with deterministic mode (& maybe check impact on timing, although recovery of TF is more important).
To bring back a thing of the past. In the CI and also local compilations on EPNs I observe always errors of this kind:
```
In file included from /opt/rocm-6.3.2/lib/llvm/lib/clang/18/include/cuda_wrappers/bits/basic_string.h:7: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/basic_string.h:40: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/ext/alloc_traits.h:34: /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:272:8: warning: 'destroy<o2::its::TrackITSExt>' is deprecated: use 'allocator_traits::destroy' instead [-Wdeprecated-declarations] 272 | { __a.destroy(__p); } | ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:378:4: note: in instantiation of function template specialization 'std::allocator_traits<std::pmr::polymorphic_allocator<o2::its::TrackITSExt>>::_S_destroy<std::pmr::polymorphic_allocator<o2::its::TrackITSExt>, o2::its::TrackITSExt>' requested here```
Notice that I see something like
c++/11/...from what I assume that this is pulling in c++11?This is of course
justa deprecation warning so not actually critical (for now...).But David and I had a chat about this a while back and had the following problem:
Almost pseudo CMake code:
```
set(CMAKE_HIP_HOST_COMPILER "$ENV{GCC_TOOLCHAIN_ROOT}/bin/gcc")
message(STATUS "SETTING CMAKE_HIP_HOST_COMPILER to ${CMAKE_HIP_HOST_COMPILER}")
enable_language(HIP)
message(STATUS "HIP language enabled: ${CMAKE_HIP_COMPILER}")
message(STATUS "COMPILER to ${CMAKE_HIP_HOST_COMPILER}")
```
Output
```
-- SETTING CMAKE_HIP_HOST_COMPILER to /home/fschlepp/alice/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-2/bin/gcc
-- HIP language enabled: /opt/rocm/llvm/bin/clang++
-- COMPILER to
```
Somehow after
enable_languagethe host compiler is reset to what? probably system.Would be nice to somehow repair this.
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20