Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test
Start: 2021-10-13T11:00:00+02:00
End: 2021-10-13T12:20:00+02:00
Location: No location set

Wednesday 13 Oct 2021, 11:00 → 12:20 Europe/Zurich

Videoconference

ALICE GPU Meeting

Zoom Meeting ID: 61230224927
Host: David Rohr
Useful links: Join via phone
Zoom URL

- 11:00 → 11:20
  PDP Run Coordination 20m
  
  Minutes
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  Event Display Commissioning
  
  ED PC delivered and working, will, bring it to the ARC today.
  
  Problems during operation
  
  TPC GPU processing crashing regularly since we updated to ROCm 4.3.
  
  No news.
  
  Currently investigation stalled due to EOS problem, data replay from EOS constantly gets stuck, and I cannot store the 50 TB locally. EPN folks and Latchezar are investigating.
  
  Issue is data driven, not reproducible with Pb-Pb MC, or with other data we recorded before.
  
  I have found a workaround, which costs a factor 2-3 in performance, but avoids the crash. Should be sufficient for the pilot beam.
  
  The same dataset crashes also in ROCm 4.1, however with a different ROCm error message. Not clear if it is the same issue or not --> Downgrade to ROCm 4.1 makes no sense.
  
  I have a reproducer with data replay from recorded raw data. 50 TB dataset. So far I could not identify a single TF in the dataset that causes the issue.
  
  Seeing backpressure messages from tpc-its-matcher in global runs, despite processing speed is sufficient. Over time, the SHM segment runs full and all runs fail after ~10 minutes - not clear whether this is related.
  
  Ruben has a reproducer that can run locally.
  
  If the its-tpc-matcher is removed, we see some backpressure messages from its-raw-decoder to readout-proxy instead (not there before), but the SHM problem is gone.
  
  To be investigated with Giulio.
  
  The underlying problem is probably also responsible for backpressure observed from QC tasks
  
  Giulio suspects a FairMQ feature which will be turned off now to check this hypothesis (first on the EPNs and then on the FLPs on Monday)
  
  Topic will be followed up offline
  
  RPMs of nightly builds not available on EPNs.
  
  Fixed, RPMs are available, night build was installed on Monday / Tuesday.
  
  Problem with Wednesday (todays) nightly build: contains both O2 and O2-dataflow
  
  Problem fixed by Timo. Waiting for a PR which should fix the ED and after that Giulio will redo the build manually (this gives us some margin before the TED shots on Friday)
  
  Reducing overhead from headers:
  
  Matthias is working on this: https://alice.its.cern.ch/jira/browse/O2-2395
  
  a PR will be prepared, aim to merge it at the end of October after the pilot beam
  
  Issues on EPN farm affecting PDP:
  
  AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
  
  Issues currently lacking manpower, waiting for a volunteer:
  
  Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
  
  Contacted GSI group whether they can implement this, no reply yet.
  
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  
  Workflow repository
  
  Waiting for AliECS to implement new fields in the GUI, demo version of GUI already implemented by Vasco. Teo has few more high priority tasks before he can finalize the GUI.
  
  Need additional DPL feature to automatically connect reconstruction and calibration workflows: https://alice.its.cern.ch/jira/browse/O2-2611.
  
  Changes for October 1st:
  
  Was delayed until Monday 4th due to TED shots.
  
  Login as epn user disabled for detector expert, no complaints so far.
  
  Automatic loading of latest O2 version switched off, can now have per workflow O2 version.
  
  EPN workflows now depend on ODS, O2PDPSuite loads ODC automatically, users can simply load O2PDPSuite/[version]
  
  Excessive O2 error messages to InfoLogger:
  
  Reducing info logger messages ongoing.
  
  Asked all detectors to use "--infologger-severity warning" or higher for their workflows. We should monitor this and enforce it.
  
  Ole will check the InfoLogger regularly and ping the responsible persons if their devices have the wrong logger severity set
  
  Freezing of FLP software:
  
  FLP will freeze the software for the pilot beam with the next FLP suite to be installed on Monday. Perhaps we should ensure that O2 dev remains compatible to that flp suite until the pilot beam (i.e. for 1 month). That would allow that we can still update O2 to dev on the EPNs. It would basically mean we should not use new features in O2, and not bump FairMQ.
  
  Giulio will keep an eye that no new PRs rely on updated dependencies. Could also be hard-coded in the defaults files in alidist
  
  Missing errors messages / information from ODC / DDS in InfoLogger:
  
  ODC does not forward errors written to stderr to the infologger, thus we do not see when a process segfaults / dies by exception / runs oom without checking the log files on the node. There is only the cryptic error that a process exited, without specifying which one. https://alice.its.cern.ch/jira/browse/O2-2604. No progress. I have called for a meeting with Mohammad to discuss how to proceed there. From the DDS team, the is currently no effort to implement this as it is deemed not needed.
  
  PartitionID and RunNr are shown only for messages coming from DPL, not for those coming from ODC/DDC: https://alice.its.cern.ch/jira/browse/O2-2602. Work in progress, needs new InfoLogger version which Sylvain will provide next week. Rest is ready.
  
  Run is not stopped when processes die unexpectedly. This should be the case, at least optionally during commissioning: https://alice.its.cern.ch/jira/browse/O2-2554. Will get FATAL messages in that case with next ODC version, but stopping of the run not yet implemented.
  
  ODC/FMQ is flooding our logs with bogus errors: https://github.com/FairRootGroup/ODC/issues/19. Fixed.
  
  Memory monitoring:
  
  We are missing a proper monitoring of the free memory in the SHM on the EPNs. Created a JIRA here: https://alice.its.cern.ch/jira/browse/R3C-638
  
  When something fails in a run, e.g. GPU getting stuck (problem reported above), this yields unclear secondary problems with processes dying because they run out of memory. This happens because too many time frames are getting in flight. Hence we must need that limitation. https://alice.its.cern.ch/jira/browse/O2-2589
  
  EOS Cleanup:
  
  Currently 27 PB of data on EOS disk buffer, all data currently written not in file catalogue, mostly raw data. When we switch to EPN2EOS for the transfer, all data will go to the catalogue. Then we have to disable the old scripts, and we have to do a cleanup campaign. We should give detectors a phase of ~2 months to mark what data is relevant, and then wipe all the rest.
- 11:20 → 11:40
  Full System Test 20m
  
  Minutes
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  Full system test status
  
  Full system test with 2 NUMA domains not terminating correctly. DD currently forwards EOS only to one consumer: https://alice.its.cern.ch/jira/browse/O2-2375. This is also a problem for some workflows on the EPN. We should think of a general solution.
  
  Discussed possible solutions. Needs a new FairMQ feature, https://github.com/FairRootGroup/FairMQ/issues/384 (jira added) - Discussed with Dennis again yesterday, took a bit longer than expected, still expected to be ready this week.
  
  AMD / GPU stability issues in FST:
  
  Compiler fails with internal error when optimization is disabled (-O0): Shall be fixed in ROCm 4.4. Waiting to receive a preliminary patch to confirm it is fixed.
  
  Reported new internal compiler error when we enable log messages in GPU management code. Shall be fixed in ROCm 4.4.Waiting to receive a preliminary patch to confirm it is fixed.
  
  AMD manually installed a ROCm 4.4 beta on one of our servers yesterday since the RPMs were not working. Gonna test with that ASAP.
  
  Random crashes processing cosmics data with both ROCm 4.1 and 4.3, data driven, not seen before. Could be the same ~8 hour MTF crash we had before, which was never understood and disappeared at some point.
  
  GPU Performance issues in FST
  
  One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
  
  Performance of 32 GB GPU with 128 orbit TF less than for the 70 orbit TF we tested in August. Results a bit fluctuating, but average is between 1600 and 1650 GPUs (compared to 1475 GPUs for 70 orbit TF). Matteo has implemented a first version of the benchmark, it is currently running on the EPN.
  
  New 6% performance regression with ROCm 4.3.
  
  Important general open points:
  
  Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
  
  https://alice.its.cern.ch/jira/browse/QC-569 : Processors should not be serialized when they access the same input data. (reported by Jens for QC, but relevant for FST). "Early forward approach merged, checked by Jens, speeds up the processing as expected".
  
  Chain getting stuck when SHM buffer runs full: on-hold: long discussion last week but no conclusion yet. All discussed solutions require knowledge how many TFs are in flight globally on a processing node, which is not yet available in DPL. https://alice.its.cern.ch/jira/browse/O2-2589.
  
  Giulio has a PR for such a throttling mechanism, currently only for analysis.
  
  Minor open points:
  
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  
  Detector status:
  
  EMCAL errors (no crash, just messages, EMCAL is working on it).
- 11:40 → 12:00
  Software for Hardware Accelerators 20m
  
  Minutes
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Report on GPU memory micro benchmark progress
  
  Matteo will present some results today.
  
  ITS GPU Tracking and Vertexing:
  
  Matteo will continue work on the tracking after the memory micro benchmarks are implemented.
  
  TRD Tracking
  
  Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
  
  David and Ole need to sit together to commission the TPC-TRD tracking on GPUs (as soon as possible...)
  
  GPU monitoring
  
  Nothing happened yet, Johannes will follow this up with Alexander. But Alexander agreed to implement Matteo's code in one way or another.
  
  ANS Encoding (Michael's presentation)
  
  Michael presented the current status. Will now start implementing the new renorming in C++.
- 12:00 → 12:20
  
  GPU Microbenchmark status 20m
  
  Speaker: Matteo Concas (INFN Torino (IT))

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test