Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
- Dev2 server available with MI210 and ROCm 7.2
- Runs stable on MI50 and MI210 GPUs, didn't check correctness yet.
- Serialization issue on MI100 seems fixed, or at least does no longer trigger immediately.
- Instead, having other random crashes on MI100 now.
- Seeing a crash on RTX6000 NVIDIA GPU with some data sets, must be either a bug in our code or on the NVIDIA side.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Sergey is checking. Differente in treatment outside of the measured region. New treatment implemented in the old maps, to check if that causes the differences.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Other topics:
- Molex connectors and cables for GPU CI server arrived, need to assemble them.
- Removed bogus setting of rocm path in LD_LIBRARY_PATH by ONNXRuntime recipe. Seems to have no side effects.
- Build failures without clear error when using old aliBuild versions. Added a feature to query the version to aliBuild, will disable all GPU builds for old versions ins the future.
EPN GPU Topics: