Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
High priority Framework issues:
- Start / Stop / Start: 2 problems on O2 side left:
-
- All processes are crashing randomly (usually ~2 out of >10k) when restarting. Stack trace hints to FMQ. https://its.cern.ch/jira/browse/O2-4639
- TPC ITS matching QC crashing accessing CCDB objects. Not clear if same problem as above, or a problem in the task itself:
- Stabilize calibration / fix EoS: New scheme: https://its.cern.ch/jira/browse/O2-4308:
- O2 with new EndOfStream deployed at P2. working so far, but didn't test in detail. Spotted also one minor problem that apparently if processes hit the stop-transition-timeout during STOP, they segfault during SHUTDOWN.
- Still need to test that it keeps running the calibration after the data processing timeout. RC will run set of tests with different time outs.
- PR with InfoLogger improvements still WIP.
- Fix problem with ccdb-populator: no idea yet - since Ole left, someone else will have to take care.
- TF-status message (from https://github.com/AliceO2Group/AliceO2/pull/13495) sent by readout-proxy. Status?
- ONNX Update Status?
Sync reconstruction
- Waiting for RC to test COSMIC replay data set.
- Waiting for RC to test STOP timeout impact.
- SW update had issues due to bug in QC, now scheduled for tomorrow.
Async reconstruction
- Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes. Checking 2 things:
- does the situation get better without GPU monitoring? --> Inconclusive
- We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
- ḾI100 GPU stuck problem will only be addressed after AMD has fixed the operation with the latest official ROCm stack.
- MI100 tests on async looked OK, though it was only ~15 jobs. All of them went through without a long GPU stall. Though there were short GPU stalls, which we also see in sync. To be investigated.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
AliECS related topics:
- Extra env var field still not multi-line by default., created https://its.cern.ch/jira/browse/OGUI-1624 to follow this up seperately from other tickets.
GPU ROCm / compiler topics:
- List of important issues with AMD:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Leaving this here in case at some point it reappears, or we want to try to understand the root cause.
- Random crashes on MI100 due to memory error, can be worked around by serializing all kernel and DMA transfers, which has 20% performance degradation.
- EPN will deploy the fix by AMD with the SW update on Thursday. Then we can remove the automatic workaround. Will only be merged into a later official ROCm release.
- Discussed with AMD the meaning of a couple of workarounds we are applying. Once we are stable, should remove some and see if it is still needed.
- HIP_DEVICE_MALLOC workaround could be removed, not needed any more. NO_SCRATC_RECLAIM workaround also removed, with new ROCm was even degrading perfromance. NO_LOWER_LDS compiler option must remain, helps amdgpu-function-calls and improves performance.
- SLC9 GPU container in Jenkins working and RPM publishing fixed. Set up as secondary container for online builds (since we still need the old container for the infra node) and switched to a unified SLC9 container for offline builds (can run on the GRID and on the EPNs with GPU).
- Created GenTopo (topology generation) build for EL9, to be used by EPN once they move their infra nodes to EL9.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Improved GPU standalone build, so we can build with sanitizers in optimized mode, and with sanitizers using clang. Revealed a couple of minor issues, which were not affecting the processing. Fixed.
- This was used to understand a long-pending issue with random Floating Point Exceptions when running TPC tracking on CPU. The problem is auto-vectorization of the compiler, which uses only 2 of the 4 SSE lanes, leading to computation of random data in the remaining lanes, which can trigger an FPE. (Random means, the compiler even initializes the data to 0, but that eventually leads to computation of sqrt(-inf). It seems there is no way to fix this. As a workaround, by default I'll disable FPE traps in the standalone benchmark when compiler with -ffast-math. (In the code I can only check for -ffast-match, not for -ftree-vectorize, which would be the correct check. But checking for fast-math might be enough, since it makes the problem much more likely.
- GCC bumped to 14.2
- Once we bump arrow (PR opened by Giulio), we can bump LLVM to 19.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object: Still WIP
- Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
- Felix debugged the OpenCL clusterization problem to be due to off-by-one offset in NoiseSuppression. Need to check how that can happen only in OpenCL.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows.
- Improved code-generation and application of DETERMINISTIC mode flags, such that GPU RTC can enable the deterministic mode and no_fast_math flags in the code it compiles. Now also yields 100% same results as the CPU in deterministic mode.
- However, originally planned to have deterministic mode a runtime flag only if RTC is used. Tunrs out this won't work, since configuration parameters might be rounded when scaled on the host before being passed on to the GPU. I.e. unavoidable to recompile the host code in deterministic mode.
- Switched from thrust library to CUB library for sorting using the full GPU device. Thrust was adding unnecessary synchronizations. I patched them away in CUDA's thrust, but never had time to do the same in HIP thrust, and my CUDA patch didn't work any more with the latest CUDA. So switching to CUB seems the simplest solution. Time per TF reduced from 4.1 to 4.0 seconds on my NVIDIA GPU, unfortunately no improvements on MI50/MI100.
- With RTC able to compile with NO_FAST_MATH, working to fix the issue that some clusters fail track-model decoding if RTC enabled due to floating point rounding.
- We had fixed this without RTC using per-kernel compilation and per-kernel compile flags, but so far this was not possible with RTC.
- Next for RTC: Ability to use different optimization parameters per compile, so we can use different optimized settings for MI50 and MI100, and we should be able to load settings at runtime for RTC, quite helpful for debugging and tuning, particularly for Gabriele.
Other Topics
- Quest position: 2 Interviews yesterday, 4 more interviews today, will decide end of the week on the candidate.
EPN major topics:
- Fast movement of nodes between async / online without EPN expert intervention.
- 2 goals I would like to set for the final solution:
- It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
- We must not lose which nodes are marked as bad while moving.
- Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
- Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
- Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
- DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
- TfBuilders should stop in ERROR when they lose connection.
- Allow epn user and grid user to set nice level of processes: https://its.cern.ch/jira/browse/EPN-349
- Updated ODC again, should now work with ALMA 9.5 on the infra nodes.
Other EPN topics: