Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-11-09T11:00:00+01:00
End: 2022-11-09T12:20:00+01:00
Location: No location set

Wednesday 9 Nov 2022, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  PDP System Run Coordination 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  EOS handling / START-STOP-START:
  Tests with START / STOP / START planned with the FLP team today at 2pm (goal is to identify detector workflow related issues and communicate them to the detector teams)
  https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
  Problem at EOR and with calib workflows:
  https://alice.its.cern.ch/jira/browse/O2-3315
  Full system test:
  Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
  Should be highest priority in DPL development, once slow topology generation, EoS timeout, START-STOP-START have been fixed.
  Possible workaround: Could implement a special topology with 2 devices in a chain (same number of pipeline levels, so we have basically 2 devices per GPU.) The first would extract all relative offsets of TPC data from the SHM buffers, and send them via an asynchronous non-DPL channel to the actual processing device. In that way, the processing could start timeframe n+1 before it returned the result for timeframe n to DPL. When DPL starts the doProcessing function, the GPU would already be processing that TF, and just return it when it is ready. Development time should be no more than 3 exclusive days.
  Global calibration status:
  Problem with TPC SCD calib performance confirmed by Ole.
  Performance problem seems to be from ROOT compression.
  First attempt will be to switch of compression and test online, then we have to find a way to speed it up.
  TPC IDC calibration:
  Testes successfully yesterday, now cleaning up obsolete variants (routing data via EPNs, and splitting over 2 calib nodes).
  Problem with TPC SAC workflow. Not receiving data on calib node from FLPs - no progress
  TPC time gain calib: implemented downscaling, tested in technical run that the workflow comes up, but could not test that it works without tracks.
  FIT calib integrated into global workflow.
  Failures reading CCDB objects at high rate.
  Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
  Speeding up startup of workflow on the EPN:
  Next step: Measure startup times in online partition controlled by AliECS, and compare to standalone measurement of 18 seconds for process startup.
  EPN Scheduler
  Problem with no jobs running due to JDL matching problem after resubmission was a vobox bug, fixed by Max.
  Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
  Discussed with Latchezar / Chiara that we will run the EPNs in preemptive mode, i.e. no draining, we just kill the grid jobs. With an agent runtime of 24 hours, we'd need to drain 24h in advance, which will in average leave the node idle for 12h. On the EPN grid async reco job time is O(1h), se we could usually run 12 additional jobs for one that will fail due to being killed.
  We'll do the same for nodes we need for software tests.
  Important framework features to follow up:
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  Found performance problem in o2-dpl-raw-proxy in TPC IDC workflow. Could process only 5kHz of messages. Took a perf trace and vast majority of time is spent in FMQ. Opened a JIRA ticket for Alexey: https://alice.its.cern.ch/jira/browse/O2-3355
  For TPC IDC workflow, now implemented a workaround with message coalescing, which is working. But raw-proxy issue should still be fixed.
  Minor open points:
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Assigned JIRA tickets that were left for Matthias to Giulio (sorry :))
  Non-critical QC tasks:
  QC tasks were backpressuring when failing, since dropping data on the route to QC tasks was not yet enabled. Changed with a software update. To be seen whether it works.
  Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
  Reconstruction performance:
  Online with pp:
  3 MHz: With the current data size, without DLBZS and without clusterizer tuning, there is no way we can reasonably reconstruct this data. (It should work with extensive tuning, but in my opinion that would be a waste of time).
  2 MHz: Processing was too slow, and despite backpressure we still hit >50% CPU load on the EPNs, so only the Hyperthreaded part was left anyway, so not much resources left. Muon reco was disabled, and next attempt would be to disable QC in addition. It might be possible to run a part of the QC, but we could not test further due to beam dump.
  500 kHz with fewer EPNs: We managed to get it running on 100 EPNs, with ~45% CPU load, and ~20 GB (out of 512) of memory left. Note that the 500 kHz config has the extra reconstruction steps (MUON, DEDX) enabled. I think this is as far as we can go, and it is already on the edge, but was running for 1h without issues, so it should be stable.
  Until we have DLBZS and clusterizer tuning, I don't see the possibility to go below 100 EPNs.
  Online with Pb-Pb:
  RC started to do SYNTHETIC Pb-Pb performance tests again. Found some TPC QC tasks that were too slow, implemented downscaling.
  Async performance:
  AOD performance issue and TPC QC downscaling was implemented, have to retest in 1 NUMA domain setup.
  Should also test using MC 50 kHz Pb-Pb MC data.
  Opened a JIRA ticket for EPN to follow up the interface to change SHM memory sizes when no run is ongoing (which was requested 1 year ago). Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
  Topology generation:
  Added feature to exclude detectors from QC and Calib, as asked by Run Coordination
  If detectors had bad QC JSONs in consul, they would be downloaded and merged, and the failure was only reported when o2-qc was invoked, and it was not even clear which detector caused the problem.
  Added JSON syntax check for each file after downloading (by design doesn't help if the syntax is correct but the content is not).
  Checking merged QC json.
  Added additional debug output to make it clearer where topology generation failed.
  DPL does not return -1 correct if o2-qc fails the topology generation. Now added a workaround in the topology generation checking stdout / stderr for error messages. https://alice.its.cern.ch/jira/browse/O2-3349
  O2 versions at P2:
  O2 update on the EPNs on Monday done by Ole, went smoothly.
  Other SRC items:
  
  QC / Monitoring updates:
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  QC / InfoLogger have fixed 3 of the 4 issues wrt log files that we reported in https://alice.its.cern.ch/jira/browse/QC-882. The 4th issue could not be reproduced. New versions deployed on EPN. Will reenable QC log files and retry.
  TPC CTF Skimming:
  Discussed approach with Ruben. Work ongoing.
- 11:20 → 11:40
  Software for Hardware Accelerators 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  General:
  AMD identified the problem on the MI210. We have to disable the workaround for the MI210, which they have asked us to put in place for the MI50, which increases the resource consumption.
  Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
  ROCm compilation issues:
  Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  ITS GPU Tracking and Vertexing:
  Matteo will spend 1 week working on multi-threading of ITS vertexing, then go back to GPU ITS tracking.
  TPC GPU Processing
  Found bug in CPU version of GPU TPC ZS v4 (DLBZS) decoder, reported to Felix - Felix still checking.
  More investigation of random GPU crashes:
  Besides the crashes from broken GPUs, and from corrupt TPC raw data, there is definitely another type of crash, that affects all GPUs, and is not triggered by corrupt raw data.
  Quite rare, but it happens more often if the node is under heavy node (e.g. in pp run with 100 EPNs happens more often than in the same run with 200 MHz.).
  Happens both with raw and with MC pp data, but seems to happen more often in real data than in MC.
  Have never seen it happening in Pb-Pb MC data despite largest statistic. Either this is coincidence, or it can only happen when certain patters / occupancies are present in the data.
  Identified one time frame which crashed 6 times in few million runs, on 6 different EPNs which were otherwise stable.
  Need to do some special runs with extensive debug output to investigate this further.
  TPC CTF Skimming:
  Implemented in TPC entropy encoder, fully working but not yet final version. Can already be used, future improvements will cure away somewhat more clusters.
  Not yet applying eta-check on unattached clusters, but storing all compatible drift-times.
  Need to take into account TPC distortions for z / eta check. Either with some margin, or assuming some average distortion corrections.
  While implementing the CTF skimming, found a problem with decoding of some TPC track model TPC clusters. Track gets completely odd parameters, tgl > 100, and produces clusters everywhere in the time frame.
  First assumed this was a side effect of the rounding problem reported some weeks ago, but the rounding cannot have such large effects.
  Even more strange, I cannot reproduce this behavior with MC data yet (tried encoding MC CTF with CPU, NVIDIA and AMD GPU, always OK).
  Need to process some raw TFs and encode them to CTFs with EPN MI50, to try to reproduce it.
  Hopefully, there is not more (invisible) corruption, where track parameters do not get completely of.
  TRD Tracking
  
  ANS Encoding
  Waiting for PR with AVX-accelerated ANS encoding