Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- New GPU architecture selection for async in O2DPG looks good, should be merged.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending, and with O2 update has become more probable to happen. Still hoping AMD will be able to implement a fix.
- Found a new regression in ROCm 7, which miscompiles on MI100 with latest O2 (independent from serialization bug). Miscompilation does not happen on MI50 or on old O2.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object:
- Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- PR merged, further developments in https://github.com/AliceO2Group/AliceO2/pull/14651, investingating some problems now, not clear if related:
- TPC vdrift and laser calibration issues.
- Crash with low-field data at end of run.
- Large memory usage in pp on the GRID.
- Ruben reported a drop of efficiency above 4 GeV.
- Next iteration of tracking improvements in draft PR: https://github.com/AliceO2Group/AliceO2/pull/14651
- gpu-reconstruction quitting with error in some async jobs due to running out of buffers: Problem was due to TPC Sector A11 (the broken one) having much fewer clusters. This, together with Ruben settings for aggressive extrapolation to adjacent sectors, let to too many sector tracks in sector A11, and gpu-reconstruction aborted because the estimated memory size was exceeded. Fixed by taking adjacent sector occupancy into account for the buffer size estimation.
EPN GPU Topics: