Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Updated default builds to include A100 GPU architecture, and 75-virtual as lowest computa capability for CUDA JIT compilation, so we should support all CUDA devices from 7.5 onwards now.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object:
- Rebased current PR, CI green now, and gives same results on GPU as on CPU. But result seem wrong, finding 10% less tracks than without the PR.
- Sergey provided a fix for time to z inverse conversion, but still not fully working, now finding 2% less tracks than without the PR.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Other topics:
- GPU CI Server: Cannot power the GPUs right now, because the voltages / pinout of the connector on the mainboard for the PCIe cable does not match the cables we have (even though the physical connector is the same, and both are HP cables. This is basically creating a short. Thank you HP...).
- Need to find other cables, or build our own cable.
EPN GPU Topics:
- AMD cannot deliver MI210 or newer samples, but Volker has some spare MI210 in Frankfurt, which he can send.
- To be inserted into the EPN farm, together with 1 MI50 and 1 MI100 as second dev-server with EPN setup. (https://its.cern.ch/jira/browse/EPN-572)
- MI210 GPU on the way to CERN, Ivan should bring it end of this week.