Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-11-16T11:00:00+01:00
End: 2022-11-16T12:20:00+01:00
Location: No location set

Wednesday 16 Nov 2022, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  PDP System Run Coordination 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  EOS handling / START-STOP-START:
  Tests with START / STOP / START planned with the FLP team today at 2pm (goal is to identify detector workflow related issues and communicate them to the detector teams)
  https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
  Problem at EOR and with calib workflows:
  https://alice.its.cern.ch/jira/browse/O2-3315
  Fixed part of this problem, there were already 2 issues:
  The calib workflow scripts were lacking the setting of the completion policy of the output-proxy. (Not sure what is the default policy, but for sure sporadic and TF proxies need different policies, so at least one of them was configured incorrectly).
  Input for sporadic proxies is not guaranteed to follow the oldestPossibleTimeslice paradigm, since it is just sent sporadically. Added an option to not drop data that is older than the assumed oldestPossibleTimeslice.
  Problem at EOR probably still exists.
  Full system test:
  Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
  Should be highest priority in DPL development, once slow topology generation, EoS timeout, START-STOP-START have been fixed.
  Possible workaround: Could implement a special topology with 2 devices in a chain (same number of pipeline levels, so we have basically 2 devices per GPU.) The first would extract all relative offsets of TPC data from the SHM buffers, and send them via an asynchronous non-DPL channel to the actual processing device. In that way, the processing could start timeframe n+1 before it returned the result for timeframe n to DPL. When DPL starts the doProcessing function, the GPU would already be processing that TF, and just return it when it is ready. Development time should be no more than 3 exclusive days.
  Global calibration status:
  Problem with TPC SCD calib performance confirmed by Ole.
  Can run online without compression, and at low rate even with compression. Ole still checking other compression algorithms.
  TPC IDC calibration:
  Cleanup for IDC done, should be fully deployed and stable now.
  Problem with SAC was due to typo in subspec, Fixed AFAIK.
  TPC time gain calib: implemented downscaling, tested in technical run that the workflow comes up, but could not test that it works without tracks.
  FIT calib integrated into global workflow.
  PHOS and ZDC calibration create ROOT files to the local folder, should be disabled.
  Failures reading CCDB objects at high rate.
  Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
  Speeding up startup of workflow on the EPN:
  Next step: Measure startup times in online partition controlled by AliECS, and compare to standalone measurement of 18 seconds for process startup.
  EPN Scheduler
  Problem with no jobs running due to JDL matching problem after resubmission was a vobox bug, fixed by Max.
  Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
  Discussed with Latchezar / Chiara that we will run the EPNs in preemptive mode, i.e. no draining, we just kill the grid jobs. With an agent runtime of 24 hours, we'd need to drain 24h in advance, which will in average leave the node idle for 12h. On the EPN grid async reco job time is O(1h), se we could usually run 12 additional jobs for one that will fail due to being killed.
  We'll do the same for nodes we need for software tests.
  Important framework features to follow up:
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  Performance problem in raw proxy solved together with Alexey. Now using 2 io threads, at the expense of additional CPU load. Still not clear why a single thread cannot do more than 1.7 GB/s on the EPN. Should be investigated on FMQ side.
  Minor open points:
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Assigned JIRA tickets that were left for Matthias to Giulio (sorry :))
  Non-critical QC tasks:
  QC tasks were backpressuring when failing, since dropping data on the route to QC tasks was not yet enabled. Changed with a software update. To be seen whether it works.
  Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
  Reconstruction performance:
  Online with Pb-Pb:
  RC started to do SYNTHETIC Pb-Pb performance tests again. Found some TPC QC tasks that were too slow, implemented downscaling.
  Async performance:
  AOD performance issue and TPC QC downscaling was implemented, have to retest in 1 NUMA domain setup.
  Should also test using MC 50 kHz Pb-Pb MC data.
  Opened a JIRA ticket for EPN to follow up the interface to change SHM memory sizes when no run is ongoing (which was requested 1 year ago). Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
  Topology generation:
  If detectors had bad QC JSONs in consul, they would be downloaded and merged, and the failure was only reported when o2-qc was invoked, and it was not even clear which detector caused the problem.
  Added JSON syntax check for each file after downloading (by design doesn't help if the syntax is correct but the content is not).
  Checking merged QC json.
  Added additional debug output to make it clearer where topology generation failed.
  DPL does not return -1 correct if o2-qc fails the topology generation. Now added a workaround in the topology generation checking stdout / stderr for error messages. https://alice.its.cern.ch/jira/browse/O2-3349
  Should change dpl-workflow script to fail if any process in the dpl pipe (workflow | workflow | ...) has non-zero exit code.
  O2 versions at P2:
  Several O2 updates in the last day done by Ole. Now cherry picking until end of Pb-Pb to guarantee stability.
  Need to deploy new version with calib fixes.
  Currently problem with RPM generation, always takes a very long time until RPMs are available. Timo is investigating.
  Other SRC items:
  
  QC / Monitoring updates:
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  New QC / Infologger versions deployed on EPN. QC log writing enabled again. Need to see whether all issues are fixed now.
  Testing Pb-Pb workflow at P2:
  Creating SYNTHETIC data set with low IR Pb-Pb data, to test the Pb-Pb workflow we want to run (which will exceptionally include MUON and DEDX, and full ITS processing). To be deployed today.
- 11:20 → 11:40
  Software for Hardware Accelerators 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  General:
  AMD identified the problem on the MI210. We have to disable the workaround for the MI210, which they have asked us to put in place for the MI50, which increases the resource consumption.
  Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
  ROCm compilation issues:
  Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Matteo implemented a workaround for the LOG(...) problem, so we can now at least use the LOG macro in the ROCm code. But the internal compiler error is not yet fixed, so it may come back.
  Another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  ITS GPU Tracking and Vertexing:
  Matteo will spend 1 week working on multi-threading of ITS vertexing, then go back to GPU ITS tracking.
  TPC GPU Processing
  Felix fixed problem in clusterization which gave different results between CPU and GPU version.
  More investigation of random GPU crashes:
  Besides the crashes from broken GPUs, and from corrupt TPC raw data, there is definitely another type of crash, that affects all GPUs, and is not triggered by corrupt raw data.
  Quite rare, but it happens more often if the node is under heavy node (e.g. in pp run with 100 EPNs happens more often than in the same run with 200 MHz.).
  Happens both with raw and with MC pp data, but seems to happen more often in real data than in MC.
  Have never seen it happening in Pb-Pb MC data despite largest statistic. Either this is coincidence, or it can only happen when certain patters / occupancies are present in the data.
  Identified one time frame which crashed 6 times in few million runs, on 6 different EPNs which were otherwise stable.
  Need to do some special runs with extensive debug output to investigate this further.
  TPC CTF Skimming:
  Implemented in TPC entropy encoder, fully working but not yet final version. Can already be used, future improvements will cure away somewhat more clusters.
  Not yet applying eta-check on unattached clusters, but storing all compatible drift-times.
  Need to take into account TPC distortions for z / eta check. Either with some margin, or assuming some average distortion corrections.
  Problem with bogus tracks during CTF skimming was due to incorrectly configured B field and TF length. Will have a fix today. We should also add b field strength and tf length to TPC CTF metadata, to check for consistency during decoding.
  TRD Tracking
  
  ANS Encoding
  Waiting for PR with AVX-accelerated ANS encoding

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC