Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-01-11T11:00:00+01:00
End: 2023-01-11T12:20:00+01:00
Location: No location set

Wednesday 11 Jan 2023, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  Discussion 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  High priority framework topics:
  Regression of START-STOP-START work that makes all runs fail with lots of error messages and breaks many calibration runs https://alice.its.cern.ch/jira/browse/O2-3315
  Revert didn't help for some reason, need to investigate further, but should fix the regression ASAP, and the general start/stop/start issue until restart of data taking.
  Async workflow for 1NUMA domain with higher multiplicities gets stuck: https://alice.its.cern.ch/jira/browse/O2-3399
  Fix START-STOP-START for good
  https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
  Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
  Fix DebugGUI to show >64k vertices also in local setup / bump imgui.
  Suppoort marking QC tasks as non-critical in DDS and O2Control topology export: https://alice.its.cern.ch/jira/browse/O2-3398
  Other framework tickets:
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
  Global calibration topics:
  TPC IDC / SAC calibration:
  More fixes applied, but combination of SAC and IDC workflow still not working, under investigation.
  Async reconstruction
  Severe memory issues over xmas break, most jobs on the EPN crashing due to going OOM.
  Compared to the run numbers / O2 versions that were used for GPU tuning / release validation, other run number / new O2 need somewhat more memory. This is only few GB, but since we have literally no margin, jobs are killed for going OOM.
  This affected the EPN more than the GRID, since EPN used a larger SHM size, to hold more time frames, due to faster GPU processing. With reduced SHM size, it seems to work on the EPN, but we have to reduce the number of TFs in flight to the level of CPU jobs.
  Some investigation of the memory usage:
  ~100 O2 + QC processes running:
  min memory usage is 134 MB
  Median is ~250 MB
  Process with large memory are: its-tracking, tpc-tracking, its-tpc matching, tpc entropy decoding, ctf reader, qc file sink, tof and some other QCs.
  But nothing to excessive, TPC + ITS tracking are at ~3 GB, the rest is 1GB or below.
  Sum of memory of processes is ~38 GB, SHM size was 19 GB, cgroup memory limit is 60 GB, so basically no margin.
  Should try to reduce memory of some processes, particularly QC, but despite we have to do it in any case, it will not help much.
  Only way out is to switch to the 1NUMA domain workflow, for which we have to fix the problem that the workflow gets stuck.
  EPN major topics:
  New AMD ROCm >= 5.4 does no longer support CentOS as operating system. Officially supported is now only RHEL, SLES, Ubuntu. Checking with AMD whether Alma or Rocky Linux would work. Should switch the EPN farm to new OS before data taking, otherwise we will not be able to deploy new fixes by AMD.
  Update ROCm to 5.3 for now.
  Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
  Opened a JIRA ticket for EPN to follow up the interface to change SHM memory sizes when no run is ongoing (which was requested 1 year ago). Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
  Other EPN topics:
  Deploy DDS with support for non-critical tasks (eg QC): https://alice.its.cern.ch/jira/browse/EPN-131
  Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
  Slurm job tmp folder, move to data disk and automatically clean up stale folders: https://alice.its.cern.ch/jira/browse/EPN-249 https://alice.its.cern.ch/jira/browse/EPN-248
  Need a log folder for topology generation: https://alice.its.cern.ch/jira/browse/EPN-247
  Need a log folder and log rotation for the GRID vobox agents: https://alice.its.cern.ch/jira/browse/EPN-246
  Check total bytes written on the SSDs, to get an impression how much of the lifetime of the SSDs we have already used, switch to using a ramdisk if necessary: https://alice.its.cern.ch/jira/browse/EPN-198 https://alice.its.cern.ch/jira/browse/EPN-237
  Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
  Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244
  After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
  Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
  If DD connection of a node fails, the node should be taken out and count against nmin, otherwise it can make the false impression that the processing on the other nodes is too slow.
  Speeding up startup of workflow on the EPN:
  Measured startup time of processes in SYNTHETIC Pb-Pb run with all detectors and max number of EPNs:
  186 seconds on FLPs (serialized with EPN startup, which happens only after EPNs are donw).
  99 seconds on EPNs.
  12 seconds (+/-1) for the PDP processes on the EPN if started up by DPL driver.
  Topology generation:
  Should change dpl-workflow script to fail if any process in the dpl pipe (workflow | workflow | ...) has non-zero exit code.
  Switching phase 1 of topology generation to using updateable RPMs instead of script in home folder (basically just copying the existing script to another place). Timo will set up the Jenkins builder.
  QC / Monitoring / InfoLogger updates:
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  CCDB topics.
  Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
  https://github.com/AliceO2Group/AliceO2/pull/9992#event-7875018488
  GPU ROCm / compiler topics:
  Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
  Work on bumping GCC still ongoing (by Giulio), will follow up with Clang 15 afterwards, once we are at arrow 10.
  Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
  Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  TPC DLBZS / SYNTHETIC runs
  Change to DLBZS format requested by TPC implemented in GPU decoder and simulation, updated SYNTHETIC data sets at P2.
  Checked some DLBZS data recorded by TPC with encoder. Found some issues (link ID incorrect), reported to TPC, fixed in new firmware. Waiting for new data to check.
  TPC GPU Processing
  Random GPU crashes under investigation.
  TPC CTF Skimming finalized.
  Bug in TPC tracking on CPU depending on number of threads. 5 gives incorrect result. 1, 2, 3, 4, 6, 64, 128 seem to be ok. Investigating.
  TPC CTF decoding now accepts "no input", not only "empty input"
  Problem in TPC tracking, when some TPC pad rows in a sector have issues, track merging across these pad rows seems not to work correctly. Investigating.
  Fixed several issues after report by Ruben about problem in refit:
  Storage of outer parameters for looping tracks (now stored at outermost position of primary leg): should we do this only when secondary legs are dropped, or always?
  Fix removal of cluster association of dropped secondary legs, if cluster is shared and attached to a primary leg of a different track.
  Fix leg counting for ce-crossing tracks.
  Seems still a problem in refit with TrackParCov track model (Not seen with GPU track model): some low Pt tracks get bogus parameters after the 3rd or 4th hit. Ruben is checking.
  TRD Tracking
  Some minor updates to fix issues with too accurate time in pile up scenario leading to fake vertices, and added parameter to remove TRD track with less than 3 matches.
  ITS GPU Tracking and Vertexing:
  Work on tracking ongoing, splitting of TF to reduce memory size implemented.
  ANS Encoding
  Waiting for PR with AVX-accelerated ANS encoding
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.