https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
Problem with no inputs shown in DebugGUI fixed.
Global calibration topics:
TPC IDC / SAC calibration:
SAC only workflow fixed.
Found 3 problems in IDC+SAC workflow:
Colliding datapecs: fixed
Some outputs declared timeframe that should be sporadic: fixed
dpl raw proxy too slow, if multiple input channels send at different rates. Proposed fix in PR
Need to recheck if there are further issues, with new software with all fixes. Will ping Robert for a slot.
Switched to 1NUMA domain slurm queue setup for vobox. Still running 1GPU workflow, but gives us more memory so we can run the optimized GPU setup.
Bug in NUMA-aware GPU selection when submitting 2 4GPU jobs to the same node, fixed in O2, and workaround in place for GRID jobs until we switch to a new O2 tag.
High failure rates tonight. 1 bad EPN node, which was taken down and EPN team informed. Other errors seem CCDB related.
Some failures in async reco again yesterday. Two priminent failure reasons:
Failures getting CCDB objects
Running out of SHM memory. Since now with the 1NUMA setup we have more memory available, we have increased the SHM segment from 20 to 30 GB.
EPN major topics:
New AMD ROCm >= 5.4 does no longer support CentOS as operating system. Officially supported is now only RHEL, SLES, Ubuntu. Checking with AMD whether Alma or Rocky Linux would work. Should switch the EPN farm to new OS before data taking, otherwise we will not be able to deploy new fixes by AMD.
Update ROCm to 5.3 for now.
Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
Opened a JIRA ticket for EPN to follow up the interface to change SHM memory sizes when no run is ongoing (which was requested 1 year ago). Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
If DD connection of a node fails, the node should be taken out and count against nmin, otherwise it can make the false impression that the processing on the other nodes is too slow.
Should change dpl-workflow script to fail if any process in the dpl pipe (workflow | workflow | ...) has non-zero exit code.
Switching phase 1 of topology generation to using updateable RPMs instead of script in home folder (basically just copying the existing script to another place). Jenkins build present, and repository installed on EPNs. Next: will change the O2DPG scripts, then change the command sent by AliECS.
QC / Monitoring / InfoLogger updates:
TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
Improve error message in AliECS GUI for EPN related failures. PDP error messages are sent via ODC in the Run reply, e.g. for topology generation failures, but ECS does not show them, but only shows generic "EPN Partition Initialize Failed" https://alice.its.cern.ch/jira/browse/OCTRL-734
Send list of FLPs in run to topology generation. https://alice.its.cern.ch/jira/browse/OCTRL-753
Send flag whether it is a production / staging environment to topology generation. https://alice.its.cern.ch/jira/browse/OCTRL-751
GPU ROCm / compiler topics:
Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
Work on bumping GCC still ongoing (by Giulio), will follow up with Clang 15 afterwards, once we are at arrow 10.
Still problem in DD after GCC bump.
Problem with ROCm 5.1, will test if it disappears with ROCm 5.3, otherwise need to check in detail.
Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
TPC GPU Processing
Random GPU crashes under investigation.
Problem in refit of lowPt tracks in TrackParCov model was due to large cluster errors and high covariance for very low Pt tracks. Ruben made some improvements to stabilize the fit.
Work on TPC track assignment finished. Should now be more precise, and TFwd/TBwd always correctly constrained.
Improvements for missing pad rows need dead-pad map, cannot generally loosen the tolerances.
Now working on distortion corrections
ITS GPU Tracking and Vertexing:
Michael is still working on the elias delta encoding, and needs to fix the algorithm to respect C++ pointer alignment constraints. Afterwards will continue integration.
Issues currently lacking manpower, waiting for a volunteer:
For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.