Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-10-19T11:00:00+02:00
End: 2022-10-19T12:20:00+02:00
Location: No location set

Wednesday 19 Oct 2022, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 1
  PDP System Run Coordination
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  EOS handling / START-STOP-START:
  Tests with START / STOP / START planned with the FLP team today at 2pm (goal is to identify detector workflow related issues and communicate them to the detector teams)
  Tests by Pippo yesterday, didn't follow it up yet.
  GPU ROCm Issues:
  Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  Full system test:
  Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
  Should be highest priority in DPL development, once slow topology generation, EoS timeout, START-STOP-START have been fixed.
  Possible workaround: Could implement a special topology with 2 devices in a chain (same number of pipeline levels, so we have basically 2 devices per GPU.) The first would extract all relative offsets of TPC data from the SHM buffers, and send them via an asynchronous non-DPL channel to the actual processing device. In that way, the processing could start timeframe n+1 before it returned the result for timeframe n to DPL. When DPL starts the doProcessing function, the GPU would already be processing that TF, and just return it when it is ready. Development time should be no more than 3 exclusive days.
  Global calibration status:
  Problem with TPC SCD calib performance confirmed by Ole, but not yet understood. Work in progress. Also interplay between branches of calibration chain are under investigation.
  TPC IDC calibration:
  Long debug session yesterday with Giulio, Chiara, Robert, David, in the end managed to ship the IDCs directly from FLPs to calib node, but workflow was to slow and we had backpressure, most likely since the build was with -O0 -g for debugging. Optimized build available now, have to retry at next opportunity.
  Failures reading CCDB objects at high rate.
  Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
  Speeding up startup of workflow on the EPN:
  Next step: Measure startup times in online partition controlled by AliECS, and compare to standalone measurement of 18 seconds for process startup.
  EPN Scheduler
  GPU grid jobs running on the EPNs in stable way now.
  Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
  Discussion on open file handles / inodes and on memory usage on Monday. Minutes here: https://indico.cern.ch/event/1210850/
  Important framework features to follow up:
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  Minor open points:
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Workflow repository:
  
  O2 versions at P2:
  O2 on EPNs with CTP workflow running now, FLP CTP workflow will only be deployed in few weeks.
  String translation TRG --> CTP implemented in gen_topo and GRP creation, since ECS uses TRG while O2 uses CTP. Works for topology generation, but was forgotten to send the string to GRP creation. Will come with next AliECS update.
  CTP now enabled by default.
  Other SRC items:
  
  QC / Monitoring updates:
  Problem with EPN2EOS rate in dataflow dashboard fixed.
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  QC log file writing on EPN reenabled and working, now with propper throttling in order not to fill the disks.
  TPC CTF Skimming:
  Discussed approach with Ruben. Work ongoing.
- 2
  Software for Hardware Accelerators
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  General:
  Problem on MI100 / MI210 not solved by new ROCm beta version. AMD can reproduce the problem and is investigating.
  Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
  ITS GPU Tracking and Vertexing:
  Matteo will spend 1 week working on multi-threading of ITS vertexing, then go back to GPU ITS tracking.
  TPC GPU Processing
  Investigation “random crashes” in TPC tracking: Had a look at all GPU memory fault error messages in the last 60 days: have to distinguish 2 cases:
  Correlated crashes: gpu-reconstruction crashes on many nodes at more or less the same time. All of the cases are most certainly caused by corrupt raw data. In each case there are plenty of warnings for corrupt TPC meta data at the same time.
  Note, that
  In order to recover half-broken time frames, if possible also TFs with broken meta info is processed by the GPU.
  Not all cases of broken meta information is detected ahead of time, i.e. meta-data corruption can lead to a GPU crash without prior warning message.
  Several of these cases were also TPC standalone tests and not global runs.
  Single crashes, where just one gpu-reconstruction process segfaults (on a longer timescale of at least one hour). Much more difficult to investigate:
  Often accompanied by a warning for corrupt meta data by that process before, so same as above but only a single TF is corrupted.
  Can be caused by bad gpu-memory, if it happens on the same gpu many times, particularly when it happens in the FST where the TF is known to be good. One has to collect statistics which GPUs fail often and replace them. But this has become pretty rare right now (e.g. epn199).
  Started to create some statistics and discussing with Dirk from EPN.
  There are several cases of only a single crash in a run, on a GPU which never crashed again. Not clear what is the cause.
  Unlikely that the GPU is broken.
  Likely that it is corrupt TPC metadata, since the checks that show error messages cannot detect all kinds of corruption.
  Could of course also be a race-condition, or other bug in the code. But cannot reproduce the problem, we store only 1 permille of raw data (and none for the TPC standalone test), and we don’t have the raw files which crashed.
  Unfortunately, cannot do anything more right now. Should fix the metadata corruption first, and replace GPUs that fail regularly. Only afterwards I would try to check for race-conditions / software bugs.
  TRD Tracking
  
  ANS Encoding
  Waiting for PR with AVX-accelerated ANS encoding