Color code: (important, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
Event Display Commissioning
Problems during operation
- TPC GPU processing crashing regularly since we updated to ROCm 4.3.
- This killed every global run / most TPC standalone runs so far.
- Not clear why it didn't occur in some TPC standalone runs, seems partially data driven.
- Over night I was able to reproduce it in a standalone data replay run with ROCm 4.3, will now retest the same setup with ROCm 4.1, to see whether it is a regression in ROCm or something else.
Test will run now for 24-48 hours. If it is stable with ROCm 4.1 a regression in 4.3 is very likely and we probably need to downgrade to 4.1 since AMD probably needs some time O(week(s)) to fix it
- How do we proceed? Downgrade ROCm again? That would require:
- EPN needs to go back from CentOS 8.4 to 8.3, since ROCm 4.1 is not compatible to 8.4 (while 4.3 is not compatible to 8.3).
- EPN would need to deploy the kernel module patch for 4.1 manually on all nodes (otherwise we are back to the situation where the nodes die randomly at start/stop, which we had before, which was fixed by 4.3).
- The old build container with 4.1 cannot build the current O2 software, but I can build a special build container only for that compatible with the current O2 build, and with ROCm 4.1. That special container then would need to be used for the builds for EPN temporarily.
- Investigated the problem a bit with GDB during a global run:
- There is an error message from the kernel module in dmesg.
- Then processing on one GPU stops.
- The application hangs indefinitely in a hipDeviceSynchronize call, the GPU doesn't respond any more to the application.
- Possible alternative to run the TPC on the CPU:
- The cluster finder is optimized for Pb-Pb and for the GPU. The implementation on the CPU is rather slow, and it gets extremely slow for sparse data. It is also unmaintained code after Felix left the group in Frankfurt.
- Die some simple benchmarking, the majority of the time is spend in cleaning / filling charge maps. This runs at ~10 GB/s which is OK but also not great.
- At the current speed, we would need ~220 EPNs for TPC CPU processing of sparse data.
- This cannot be avoided without changing the way the clusterizer works (No problem on the GPU, since it has 1TB/s throughput, and it can better overlap with processing.
- I see 2 ways for improvements:
- Use vector instructions, which might require to compile the code with a proper -march flag.
- I can tune something in the OpenMP scheme to process multiple sectors in parallel, That could yield a speedup of probably ~2x to 4x.
- Johannes will be on vacation starting tomorrow and will be back at the end of next week. For a downgrade of the EPN servers he would be needed. He might be able to look into it next week.
Issues on EPN farm affecting PDP:
- AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
Issues currently lacking manpower, waiting for a volunteer:
- Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
- Discussed with Volker, should ideally be done by GSI group, since it mostly involves FMQ/DDS/ODC, which are all developed at GSI. Will contact Mohammad for that.
- For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Workflow repository
- Waiting for AliECS to implement new fields in the GUI, demo version of GUI already implemented by Vasco.
- New ODC deployed, O2 version will be made selectable 1.10.
- New repository already in use, automatic merging of QC workflows implemented, next step is calibration workflows.
EPN DPL Metric monitoring:
- Metric data rate was too high, fix already deployed but not yet tested, due to problems in global runs with TPC.: https://alice.its.cern.ch/jira/browse/O2-2583
Excessive error messages to InfoLogger:
- Detectors are sending excessive error messages for corrupt raw data. We should reduce this in a way that we still see when there are errors, but we must not flood the log. This was observed yesterday from ITS and TOF.
Missing errors messages / information from ODC / DDS in InfoLogger:
- ODC does not forward errors written to stderr to the infologger, thus we do not see when a process segfaults / dies by exception / runs oom without checking the log files on the node. There is only the cryptic error that a process exited, without specifying which one. https://alice.its.cern.ch/jira/browse/O2-2604
- PartitionID and RunNr are shown only for messages coming from DPL, not for those coming from ODC/DDC: https://alice.its.cern.ch/jira/browse/O2-2602.
- Run is not stopped when processes die unexpectedly. This should be the case, at least optionally during commissioning: https://alice.its.cern.ch/jira/browse/O2-2554
- ODC/FMQ is flooding our logs with bogus errors: https://github.com/FairRootGroup/ODC/issues/19. Fix available, to be deployed in next days.
Memory monitoring:
- We are missing a proper monitoring of the free memory in the SHM on the EPNs. Created a JIRA here: https://alice.its.cern.ch/jira/browse/R3C-638
- When something fails in a run, e.g. GPU getting stuck (problem reported above), this yields unclear secondary problems with processes dying because they run out of memory. This happens because too many time frames are getting in flight. Hence we must need that limitation. This is already a JIRA in the FST section, just repeating it here since it affects data taking.