Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
High priority framework topics:
- Problem with EndOfStream / Dropping Lifetime::Timeframe data
- 3rd debug feature in development (verify that lifetime::timeframe messages were created): Implemented and merged.
- Change consumeWhenAll completion policy to wait for oldestPossibleTimeframe: PR has been merged.
- If consumeWhenAny, do not check for lifetime::timeframe input / output agrrement. Implemented but not merged.
- Fix parsing of spec strings, to allow specifying lifetime without specifying subspec. Apparently not a limitation in DPL but in QC? Status?
- Problem with QC topologies with expendable tasks - Fixed in DPL, waiting for feedback.
- New issue: sometimes CCDB populator produces backpressure, without processing data. Crashed several Pb-Pb runs yet: https://alice.its.cern.ch/jira/browse/O2-4244
- Disappeared after disabled CPV gain calib, that was very slow. However, this can only have hidden the problem. Apparently there is a race condition that can trigger a problem in the input handling, which makes the CCDB populator stuck. Since the run funciton of the CCDB populator is not called and it does not have a special completion policy, but simply consumeWhenAny, this is likely to be a generic problem.
- Cannot be debugged Pb-Pb right now, since it is mitigated. But must be understood afterwards.
- Issue appeared with TFs that have no data. Must suppress calling run function in some cases. Lead to problem with some workflows at P2. Status?
- Problem with Calib CCDB object not arriving at some devices (while the same object arrives at other devices, so it was for sure fetched). Happens at P2 and in async reco, and sometimes also in FST. Not reproducible with a single TF, but when running for a certain time.
- C++20 / ROOT 6.30 status?
- Implement new EndOfStream scheme for calibration.
Other framework tickets:
- TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681
- Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
- Backpressure reporting when there is only 1 input channel: no progress: https://alice.its.cern.ch/jira/browse/O2-4237
- Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
- https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
- https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
- https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
- https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
- https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
- https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
- Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
- Support in DPL GUI to send individual START and STOP commands.
- Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
- DPL sending SHM metrics for all devices, not only input proxy: https://alice.its.cern.ch/jira/browse/O2-4234
- Some improvements to ease debugging: https://alice.its.cern.ch/jira/browse/O2-4196 https://alice.its.cern.ch/jira/browse/O2-4195 https://alice.its.cern.ch/jira/browse/O2-4166
- After Pb-Pb, we need to do a cleanup session and go through all these pending DPL tickets with a higher priority, and finally try to clean up the backlog.
Global calibration topics:
- TPC IDC and SAC workflow issues to be reevaluated with new O2 at restart of data taking. Cannot reproduce the problems any more.
Sync processing
- Created a JIRA ticker summarizing my proposal for a script that parses and summarizes InfoLogger messages: https://alice.its.cern.ch/jira/browse/R3C-992
- Software update postponed due to problem with calling / not calling run() function.
Async reconstruction
- Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes.
- Checking 2 things: does the situation get better without GPU monitoring? --> Inconclusive
- We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
- ḾI100 GPU stuck problem will only be addressed after AMD has fixed the operation with the latest official ROCm stack.
EPN major topics:
- Fast movement of nodes between async / online without EPN expert intervention.
- 2 goals I would like to set for the final solution:
- It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
- We must not lose which nodes are marked as bad while moving.
- Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
- Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
- Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
- DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
- TfBuilders should stop in ERROR when they lose connection.
- Had a problem bumping to CMake 3.28, required a fix in DD. Done by Lubos.
Other EPN topics:
Raw decoding checks:
- Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
Full system test issues:
Topology generation:
- Should test to deploy topology with DPL driver, to have the remote GUI available. Status?
QC / Monitoring / InfoLogger updates:
- TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
AliECS related topics:
- Extra env var field still not multi-line by default.
High priority RC YETS issues:
- Fix dropping lifetime::timeframe for good
- Giulio has implemented all the requested features.
- This triggered a regression for some workflows for TFs without data.
- Need to fix the software, runs at P2 with the latest version, and then check what errors we see.
- There is an independent problem with CCDB objects getting lost by DPL leading to "Dropping lifetime::timeframe"
- Expandable tasks in QC. Everything merged. Needs to be tested.
- Stabilize calibration / fix EoS: We have a plan how to implement it. Will take some time, but hopefully before restart of data taking.
- Fix problem with ccdb-populater: no idea yet, no ETA.
GPU ROCm / compiler topics:
- Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
- Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
- Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
- Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
- Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
- Tested new ROCm 6.0.
- Crashes with a new failure on MI100 GPUs. Doesn’t crash every time, but reproducible in few minutes.Also crashes sometimes on MI50 but more rare. Data-driven problme, does not crash with all data sets. Provided a reproducer to AMD.
- Since ROCm 6 seems to work partially with MI50. Retried all the pending issues with ROCm with the new version on MI50. Some have been fixed. Some are still pending. Providing an update with the current status of all pending issues for AMD. To be discussed in a meeting with EPN and AMD.
- Plan is to:
- Make official ROCm release fully stable on MI50 and MI100.
- Solve all types of GPU stalls.
- Solve the other pending issues.
TPC GPU Processing
- Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
- New problem with bogus values in TPC fast transformation map still pending. Sergey is investigating, but waiting for input from Alex.
- TPC has provided the dead channel map as calib object. Treatment of dead channel map in Track-following missed-row cut implemented. Jens and me are checking the performance.
- Marian and Jens requested a cut to exclude regions from tracking. So for 3 options:
- Exclude rows < n per sector: 36 bytes of parameters, can fit in GPU constant memory. Will probably implement this anyway at first.
- Use bitmap as for the dead channel map, few kb.
- More generic approach as recommended by Marian. We should make sure not to use large splines with many nodes. E.g. the transformation map has up to 600 MB!
- FPE Problem was fixed. Was a generic problem where tracks were not correctly rejected, and half-filled track objects were passed to the output. Also caused part of the tracks with bogus sinPhi.
- Problem with tracks having invalid sinPhi in OuterParam.
- Problem is that the track can have a bogus sinPhi at the moment the OuterParam is stored.
- Now solves temporarily by constraining sinPhi.
- Would like to change the behavior in general:
- Tracks should never get into a state with bogus sinPhi, i.e. prevent that in propagation.
- There are some cases where this can happen, e.g. when rotating / mirroring looping tracks, before transporting them to other pad rows, they can have a bogus state, and if e.g. all future updates fail, this could be stored as OuterParam.
- I'd like to change looper handling such to not even merge the track segments, but keep them separate and fit them individually, and only extrapolate in between them. Could be done optionally only in case the full fit has problems. Let's see.
- Problem spotted in the multi-threaded CPU decoding of track-model clusters by Gabriele, affected 0 to a few (order of 10) clusters in 5-10% of the TFs, so negligible. Problem turned out to be a race condition bug and was fixed.
- Gabriele reported another problem in the gpu-encoding of unattached clusters. Currently checking. A bit concerning since it would affect last years data (only unattached clusters though).
General GPU Processing
- Started work to make O2 propagator easily usable in ITS tracking, which is not part of the GPU reconstruction library.
- More complicated than it might seem:
- Propagator uses GPU constant cache.
- Constant cache is a static symbol per compilation using, if device relocatable code is not used.
- We cannot use device-relocatable-code at the moment, this time it fails in CUDA not in HIP. Underlying problems are:
- Code gets slower (often only marginally, similarly to -fPIC on CPU).
- Functions would be compiled once, and then used in different GPU kernels. Different GPU kernels have different register constraints, and compiling the function with default constraints makes them incompatible with some kernels.
- Could be worked around by putting explicit register constraints for the functions, but:
- Leads to performance degradation.
- Makes the code more complicated.
- Constraints would need to depend on actual target architecture, making it even more complicated.
- Without device-relocatable-code, we have multiply instances of the constant memory cache symbol, and not all of them are updated when the constants are filled.
- Work plan is now to (goes beyond the ITS problem)
- Switch CUDA kernel definitions from C++ to CMake, to allow to compile all kernels as individual compilation unit (multi-threads and thus speeds up compilation): DONE
- Make this work with device-relocatable-code optionally (register constrain problem disappears, since functions are compiled multiple times, once per kernel): DONE
- Turned out not to work with device-relocatable code for technical reasons, not followed up for now.
- If device-relocatable-code is not used, automatically obtain and update all constant cache symbols: DONE
- Provide an (optionally device-relocable-code) object that can be linked to other GPU code e.g. ITS, which provides all code needed to use the propagator. The same mechanism as for the other kernel files will obtain and fill the constant cache: WIP
- Use constant memory in fewer places, to disentangle the code. Particularly, pass processing context as kernel argument not in constant cache.
- Once this is all working in CUDA, port over all the work to the HIP backend, including RTC.
- Switch the HIP backend to autogenerate the HIP code from the CUDA code.