Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Tested ROCm 7.2 on MI50 / 100 / 210. Running stably on 50 / 210, not checked for correctness yet. Crashes randomly on MI100, but seems to be different pattern compared to serialization bug we had before.
- Need to understand and fix crash on RTX Pro 6000.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- TODO: Workaround for wrong field used for encoding online, make memory scaling factors configurable via ConfigurableParam
Other topics:
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
- Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
- required to adapt / bump ~30 packages, now nearly done.
- Remaining issues: 1 problem in O2 (only on Mac) and one in O2Physics (wrong boost usage), PRs with fixes open.
- Need new DD tag, PR open.
- Problem with new libwebsocket on RHEL7 due to bogus kernel headers in that version colliding with glibc. Must either switch AliRoot CI to SLC9 and drop slc7 support, or we can disable ipv6 for slc7.
EPN GPU Topics: