Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-12-21T11:00:00+01:00
End: 2022-12-21T12:20:00+01:00
Location: No location set

Wednesday 21 Dec 2022, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  PDP System Run Coordination 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  Problem with EoS and START/STOP/START:
  Tests with START / STOP / START planned with the FLP team today at 2pm (goal is to identify detector workflow related issues and communicate them to the detector teams)
  https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
  https://alice.its.cern.ch/jira/browse/O2-3315
  Revert didn't help for some reason, need to investigate further, but should fix the regression ASAP, and the general start/stop/start issue until restart of data taking.
  Full system test:
  Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
  Should be highest priority in DPL development, once slow topology generation, EoS timeout, START-STOP-START have been fixed.
  Possible workaround: Could implement a special topology with 2 devices in a chain (same number of pipeline levels, so we have basically 2 devices per GPU.) The first would extract all relative offsets of TPC data from the SHM buffers, and send them via an asynchronous non-DPL channel to the actual processing device. In that way, the processing could start timeframe n+1 before it returned the result for timeframe n to DPL. When DPL starts the doProcessing function, the GPU would already be processing that TF, and just return it when it is ready. Development time should be no more than 3 exclusive days.
  Global calibration status:
  Problem with TPC SCD calib performance confirmed by Ole.
  Much faster with zstd, will switch to zstd.
  TPC IDC / SAC calibration:
  More fixes applied, but combination of SAC and IDC workflow still not working, under investigation.
  Failures reading CCDB objects at high rate.
  Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
  https://github.com/AliceO2Group/AliceO2/pull/9992#event-7875018488
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
  Speeding up startup of workflow on the EPN:
  Measured startup time of processes in SYNTHETIC Pb-Pb run with all detectors and max number of EPNs:
  186 seconds on FLPs (serialized with EPN startup, which happens only after EPNs are donw).
  99 seconds on EPNs.
  12 seconds (+/-1) for the PDP processes on the EPN if started up by DPL driver.
  EPN Scheduler
  Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
  Problem with GPU GRID jobs being killed due to exceeding the 60 GB memory limit. Not clear what happens. Unable to reproduce the issue running the same job manually 8 times on the same EPN, using the same asysnc slurm queue and the same 60 GB per slurm job memory limit.
  Important framework features to follow up:
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  Minor open points:
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Non-critical QC tasks:
  Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
  Reconstruction performance:
  Async performance:
  Fixes for TPC QC and AOD working, testing the 1 NUMA config.
  Unfortunately spotted 2 new problems:
  Async workflow gets stuck when I use the tuned multiplicites. Giulio and me are checking. https://alice.its.cern.ch/jira/browse/O2-3399
  I can work around that by disabling AOD writing (not clear why). But then we get oscilations with TF rate limiting, and removing bottlenecks only amplifies the oscilations. Still checking, but I do not manage to get 100 % CPU load.
  Performance status so far: 3.45 s per TF in 4GPU setup, compared to 3.55s in 4 * 1GPU setup, on LHC22f data.
  average EPN CPU load at ~40%. The workflow could use 50%, so there is 25% margin for improvement.
  AOD production, which was disabled, would use ~3%, so the net margin is 7/43 = 16%, i.e. speed of light is 2.97s per TF.
  Doing Pb-Pb benchmarks now (using plain dpl-workflow.sh without extra env variables, since we use MC data)
  Opened a JIRA ticket for EPN to follow up the interface to change SHM memory sizes when no run is ongoing (which was requested 1 year ago). Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
  Topology generation:
  Should change dpl-workflow script to fail if any process in the dpl pipe (workflow | workflow | ...) has non-zero exit code.
  TPC DLBZS / SYNTHETIC runs
  Change to DLBZS format requested by TPC implemented in GPU decoder and simulation, updated SYNTHETIC data sets at P2.
  Checked some DLBZS data recorded by TPC with encoder. Found some issues (link ID incorrect), reported to TPC, fixed in new firmware. Waiting for new data to check.
  
  QC / Monitoring / InfoLogger updates:
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  No issues with QC log file writing so far, seems we have fixed all problems.
  Commented on https://alice.its.cern.ch/jira/browse/OLOG-59 what we had discussed last time:
  ALARM / IMPORTANT should be support level in any case.
  FATAL / ERROR could be Ops or support level. To be decided by RC. But if they are support level, the QC shifter would need to check.
  After discussion with RC, they would prefer if checks for e.g. corrupt data could be coalesced, and a single warning to Ops is emitted to check data quality.
- 11:20 → 11:40
  Software for Hardware Accelerators 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  General:
  Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
  Work on bumping GCC still ongoing (by Giulio), will follow up with Clang 15 afterwards, once we are at arrow 10.
  ROCm compilation issues:
  Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  ITS GPU Tracking and Vertexing:
  Matteo will spend 1 week working on multi-threading of ITS vertexing, then go back to GPU ITS tracking.
  TPC GPU Processing
  Random GPU crashes under investigation.
  TPC CTF Skimming finalized.
  Bug in TPC tracking on CPU depending on number of threads. 5 gives incorrect result. 1, 2, 3, 4, 6, 64, 128 seem to be ok. Investigating.
  Problem in CTF decoding when input is missing (empty input is OK, but we want a fallback treatment in case the input is an empty message, not only empty header).
  Problem in TPC tracking, when some TPC pad rows in a sector have issues, track merging across these pad rows seems not to work correctly. Investigating.
  TRD Tracking
  
  ANS Encoding
  Waiting for PR with AVX-accelerated ANS encoding

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC