Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Need to check ROCm 7.2 corrtecness.
- Need to understand and fix crash on RTX Pro 6000.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- TODO: Workaround for wrong field used for encoding online, make memory scaling factors configurable via ConfigurableParam
Other topics:
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
- Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
- Remaining issues:
- libwebsockets kernel headers
- Ernst following up issues with ODC/DDS
- aarch uses old python, trying to compilg old xgboost, which is incompatible.
- One compilation problem on MacOS with boost histogram.
EPN GPU Topics: