Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.
- Please upload to https://docs.google.com/document/d/1eek6kv_SqHE6b5k0KHs-6wcjHIxCKyQr6eldnhsptvY/edit?tab=t.0
Sync reconstruction
- Crash in TPC ZS decoding when receiving bad data, should check and improve protection such that we do not crash.
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- New GPU architecture selection for async in O2DPG looks good, should be merged.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- AMD is changing their support structure, we shall fill reports via github (which I like, then it is also better traceable). But they will no longer assign an engineer to follow up all our issues, but has to go through their normal support process. In particular, that means their compiler team might not be able to fix issues, if we do not provide a minimal reproducer.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object:
- Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- New safer, dynamic cluster protection working and deployed at P2.
Other topics:
- GRID Memory monitoring: Discussed with Maksim, the problem with incorrect values vrom smaps is already fixed, since they switched to cgroup monitoring.
EPN GPU Topics: