Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
- ROCm 7.2 released, waiting for EPN to set it up with new dev2 server.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Other topics:
- Molex connectors for GPU CI server arrived, still waiting for cables.
- Removed bogus setting of rocm path in LD_LIBRARY_PATH by ONNXRuntime recipe. Seems to have no side effects.
EPN GPU Topics:
- Status of power cable problem for dev2 server and MI210?