Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
Event Display Commissioning
- ED PC delivered and working, will, bring it to the ARC today.
Problems during operation
- TPC GPU processing crashing regularly since we updated to ROCm 4.3.
- No news.
- Currently investigation stalled due to EOS problem, data replay from EOS constantly gets stuck, and I cannot store the 50 TB locally. EPN folks and Latchezar are investigating.
- Issue is data driven, not reproducible with Pb-Pb MC, or with other data we recorded before.
- I have found a workaround, which costs a factor 2-3 in performance, but avoids the crash. Should be sufficient for the pilot beam.
- The same dataset crashes also in ROCm 4.1, however with a different ROCm error message. Not clear if it is the same issue or not --> Downgrade to ROCm 4.1 makes no sense.
- I have a reproducer with data replay from recorded raw data. 50 TB dataset. So far I could not identify a single TF in the dataset that causes the issue.
- Seeing backpressure messages from tpc-its-matcher in global runs, despite processing speed is sufficient. Over time, the SHM segment runs full and all runs fail after ~10 minutes - not clear whether this is related.
- Ruben has a reproducer that can run locally.
- If the its-tpc-matcher is removed, we see some backpressure messages from its-raw-decoder to readout-proxy instead (not there before), but the SHM problem is gone.
- To be investigated with Giulio.
- The underlying problem is probably also responsible for backpressure observed from QC tasks
- Giulio suspects a FairMQ feature which will be turned off now to check this hypothesis (first on the EPNs and then on the FLPs on Monday)
- Topic will be followed up offline
- RPMs of nightly builds not available on EPNs.
- Fixed, RPMs are available, night build was installed on Monday / Tuesday.
- Problem with Wednesday (todays) nightly build: contains both O2 and O2-dataflow
- Problem fixed by Timo. Waiting for a PR which should fix the ED and after that Giulio will redo the build manually (this gives us some margin before the TED shots on Friday)
Reducing overhead from headers:
- Matthias is working on this: https://alice.its.cern.ch/jira/browse/O2-2395
- a PR will be prepared, aim to merge it at the end of October after the pilot beam
Issues on EPN farm affecting PDP:
- AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
Issues currently lacking manpower, waiting for a volunteer:
- Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
- Contacted GSI group whether they can implement this, no reply yet.
- For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Workflow repository
- Waiting for AliECS to implement new fields in the GUI, demo version of GUI already implemented by Vasco. Teo has few more high priority tasks before he can finalize the GUI.
- Need additional DPL feature to automatically connect reconstruction and calibration workflows: https://alice.its.cern.ch/jira/browse/O2-2611.
Changes for October 1st:
- Was delayed until Monday 4th due to TED shots.
- Login as epn user disabled for detector expert, no complaints so far.
- Automatic loading of latest O2 version switched off, can now have per workflow O2 version.
- EPN workflows now depend on ODS, O2PDPSuite loads ODC automatically, users can simply load O2PDPSuite/[version]
Excessive O2 error messages to InfoLogger:
- Reducing info logger messages ongoing.
- Asked all detectors to use "--infologger-severity warning" or higher for their workflows. We should monitor this and enforce it.
- Ole will check the InfoLogger regularly and ping the responsible persons if their devices have the wrong logger severity set
Freezing of FLP software:
- FLP will freeze the software for the pilot beam with the next FLP suite to be installed on Monday. Perhaps we should ensure that O2 dev remains compatible to that flp suite until the pilot beam (i.e. for 1 month). That would allow that we can still update O2 to dev on the EPNs. It would basically mean we should not use new features in O2, and not bump FairMQ.
- Giulio will keep an eye that no new PRs rely on updated dependencies. Could also be hard-coded in the defaults files in alidist
Missing errors messages / information from ODC / DDS in InfoLogger:
- ODC does not forward errors written to stderr to the infologger, thus we do not see when a process segfaults / dies by exception / runs oom without checking the log files on the node. There is only the cryptic error that a process exited, without specifying which one. https://alice.its.cern.ch/jira/browse/O2-2604. No progress. I have called for a meeting with Mohammad to discuss how to proceed there. From the DDS team, the is currently no effort to implement this as it is deemed not needed.
- PartitionID and RunNr are shown only for messages coming from DPL, not for those coming from ODC/DDC: https://alice.its.cern.ch/jira/browse/O2-2602. Work in progress, needs new InfoLogger version which Sylvain will provide next week. Rest is ready.
- Run is not stopped when processes die unexpectedly. This should be the case, at least optionally during commissioning: https://alice.its.cern.ch/jira/browse/O2-2554. Will get FATAL messages in that case with next ODC version, but stopping of the run not yet implemented.
- ODC/FMQ is flooding our logs with bogus errors: https://github.com/FairRootGroup/ODC/issues/19. Fixed.
Memory monitoring:
- We are missing a proper monitoring of the free memory in the SHM on the EPNs. Created a JIRA here: https://alice.its.cern.ch/jira/browse/R3C-638
- When something fails in a run, e.g. GPU getting stuck (problem reported above), this yields unclear secondary problems with processes dying because they run out of memory. This happens because too many time frames are getting in flight. Hence we must need that limitation. https://alice.its.cern.ch/jira/browse/O2-2589
EOS Cleanup:
- Currently 27 PB of data on EOS disk buffer, all data currently written not in file catalogue, mostly raw data. When we switch to EPN2EOS for the transfer, all data will go to the catalogue. Then we have to disable the old scripts, and we have to do a cleanup campaign. We should give detectors a phase of ~2 months to mark what data is relevant, and then wipe all the rest.