Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
Event Display Commissioning
- Supplier has informed us that despite the ED PC being confirmed as in stock before, they cannot ship it before January. Since we have a temporary setup now, we'll only replace the PC for next year. In the meantime Guy is also checking alternatives in case they still cannot deliver in January.
Problems during operation
- TPC GPU processing crashing regularly since we updated to ROCm 4.3.
- I have a reproducer with data replay from recorded raw data. 50 TB dataset. So far I could not identify a single TF in the dataset that causes the issue.
- The same dataset crashes also in ROCm 4.1, however with a different ROCm error message. Not clear if it is the same issue or not --> Downgrade to ROCm 4.1 makes no sense.
- I have found a workaround, which costs a factor 2-3 in performance, but avoids the crash. Should be sufficient for the pilot beam.
- Issue is data driven, not reproducible with Pb-Pb MC, or with other data we recorded before.
- Currently investigation stalled due to EOS problem, data replay from EOS constantly gets stuck, and I cannot store the 50 TB locally. EPN folks and Latchezar are investigating.
- Seeing backpressure messages from tpc-its-matcher in global runs, despite processing speed is sufficient. Over time, the SHM segment runs full and all runs fail after ~10 minutes - not clear whether this is related.
- Ruben has a reproducer that can run locally.
- If the its-tpc-matcher is removed, we see some backpressure messages from its-raw-decoder to readout-proxy instead (not there before), but the SHM problem is gone.
- To be investigated with Giulio.
- RPMs of nightly builds not available on EPNs. We have nightly builds of the O2PDPSuite for the EPNs now, but the RPMs are not available. That should be fixed ASAP, such that we can update the software more easily / flexibly.
Issues on EPN farm affecting PDP:
- AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
Issues currently lacking manpower, waiting for a volunteer:
- Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
- Contacted GSI group whether they can implement this, no reply yet.
- For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Workflow repository
- Waiting for AliECS to implement new fields in the GUI, demo version of GUI already implemented by Vasco.
- New repository already in use, automatic merging of QC workflows implemented, next step is calibration workflows.
- Need additional DPL feature to automatically connect reconstruction and calibration workflows: https://alice.its.cern.ch/jira/browse/O2-2611
Changes for October 1st:
- Login as epn user will be disabled for detector expert, login should be via NICE credentials. EPN still needs to fix the log file mode issue, such that experts can read logs: https://alice.its.cern.ch/jira/browse/O2-2555
- Automatic loading of DD/QC/O2 will be disabled for workflows, only workflows that load modules explicitly will keep working. https://alice.its.cern.ch/jira/browse/O2-2553
- With the next O2 update, workflows will depend on ODC. Users should load O2PDPSuite to have all required dependencies.
EPN DPL Metric monitoring:
- Too high metric rate fixed, now in operation: https://alice.its.cern.ch/jira/browse/O2-2583
Excessive error messages to InfoLogger:
- Reducing info logger messages ongoing.
Missing errors messages / information from ODC / DDS in InfoLogger:
- ODC does not forward errors written to stderr to the infologger, thus we do not see when a process segfaults / dies by exception / runs oom without checking the log files on the node. There is only the cryptic error that a process exited, without specifying which one. https://alice.its.cern.ch/jira/browse/O2-2604
- PartitionID and RunNr are shown only for messages coming from DPL, not for those coming from ODC/DDC: https://alice.its.cern.ch/jira/browse/O2-2602.
- Run is not stopped when processes die unexpectedly. This should be the case, at least optionally during commissioning: https://alice.its.cern.ch/jira/browse/O2-2554
- ODC/FMQ is flooding our logs with bogus errors: https://github.com/FairRootGroup/ODC/issues/19. Fixed with the next O2 version, need to update.
Memory monitoring:
- We are missing a proper monitoring of the free memory in the SHM on the EPNs. Created a JIRA here: https://alice.its.cern.ch/jira/browse/R3C-638
- When something fails in a run, e.g. GPU getting stuck (problem reported above), this yields unclear secondary problems with processes dying because they run out of memory. This happens because too many time frames are getting in flight. Hence we must need that limitation. https://alice.its.cern.ch/jira/browse/O2-2589
EOS Cleanup:
- Currently 20 PB of data on EOS disk buffer, all data currently written not in file catalogue, mostly raw data. When we switch to EPN2EOS for the transfer, all data will go to the catalogue. Then we have to disable the old scripts, and we have to do a cleanup campaign. We should give detectors a phase of ~2 months to mark what data is relevant, and then wipe all the rest.