Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
- Updated GPU buffer sizes for 0.2T low field TPC processing, will move parameters to configurableParam, so we can change them without rebuilding.
- Problem in 1 run where online CCDB updates were not working, and TPC track model encoding used incorrect field settings. Will implement a workaround for decoding, to use stored incorrect field for decoding but then correct field for tracking.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Tested ROCm 7.2 on MI50 / 100 / 210. Running stably on 50 / 210, not checked for correctness yet. Crashes randomly on MI100, but seems to be different pattern compared to serialization bug we had before.
- Need to understand and fix crash on RTX Pro 6000.
- New GPU Builder Container with CUDA 13.1.1 available, supporting GCC 15.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Other topics:
- Major improvement of GPU CMake and parameter loading:
- Instead of detecting individual architectures, now performing >= comparison on compute capabilities, and if no tuning for exact architecture available, take parameters for the closest previous architectures where we have tuned values.
- Support building O2 with database of tuned parameters provided in CSV format.
- Support to merge on-the-fly multiple database files in CSV and JSON format.
- Can create binary .par files for loading at RTC from CSV and JSON file with simple script.
- Sped up GPU CMake from ~2sec to ~0sec.
- Switching from hard-exporting ALIBUILD_O2_FORCE_GPU=1 in GPU builder container, moved to env files of CI jobs and to default of jenkins builders where GPU is needed, and can e.g. be disabled for dataflow defaults CI jobs / jenkins builders.
- Everyone using the slc9-gpu-builder container, or Jenkins to build with GPU, please note that to get the old behavior ALIBUILD_O2_FORCE_GPU=1 must be exported.
- Note that by default, GPUs will be autodetected, so all backends would be available, but builds would be to the fallback architectures if no GPU is detected (MI50 for AMD; sm_75-virtual for NVIDIA), and not to our default list of production architectures (including MI100, RTX Pro 6000, ...).
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2.
- Opened PRs to bump CMake to 4.2 (https://github.com/alisw/alidist/pull/6135), boost to 1.90 (https://github.com/alisw/alidist/pull/6134), Giulio will take care of bumping GCC to 15.2, Could also think about bumping arrow and clang.
EPN GPU Topics: