Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      PDP System Run Coordination 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Problem with EoS  and START/STOP/START:

      • Tests with START / STOP / START planned with the FLP team today at 2pm (goal is to identify detector workflow related issues and communicate them to the detector teams)
      • https://alice.its.cern.ch/jira/browse/O2-3315
        • All runs fail with errors at the end, sometimes losing calibration data.
        • Fixed part of this problem, there were already 2 issues:
          • The calib workflow scripts were lacking the setting of the completion policy of the output-proxy. (Not sure what is the default policy, but for sure sporadic and TF proxies need different policies, so at least one of them was configured incorrectly).
          • Input for sporadic proxies is not guaranteed to follow the oldestPossibleTimeslice paradigm, since it is just sent sporadically. Added an option to not drop data that is older than the assumed oldestPossibleTimeslice.
      • For now we'll try to revert the work that was done on the START/STOP/START to check if it fixes the regression we have already since few weeks. Giulio to state which commits to revert. Ole will build an alternative version on the EPNs, so we can test there (since cannot reproduce locally). If this fixes the issue, we revert in O2, and we only merge the changes together with the fix for the regression into O2 again after validating with an alternative version on the EPNs that there is no problem at end of run.

      Full system test:

      • Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
        • Should be highest priority in DPL development, once slow topology generation, EoS timeout, START-STOP-START have been fixed.
        • Possible workaround: Could implement a special topology with 2 devices in a chain (same number of pipeline levels, so we have basically 2 devices per GPU.) The first would extract all relative offsets of TPC data from the SHM buffers, and send them via an asynchronous non-DPL channel to the actual processing device. In that way, the processing could start timeframe n+1 before it returned the result for timeframe n to DPL. When DPL starts the doProcessing function, the GPU would already be processing that TF, and just return it when it is ready. Development time should be no more than 3 exclusive days.

      Global calibration status:

      • Problem with TPC SCD calib performance confirmed by Ole.
        • Much faster with zstd, will switch to zstd. Status?
      • TPC IDC / SAC calibration:
        • Cleanup for IDC done, should be fully deployed and stable now.
        • Problem with SAC was due to typo in subspec, Fixed AFAIK.
        • After fix, SAC via calib node is working. For some reason SAC still fails when both SAC and IDC over the calib node are active together. Under investigation.
      • PHOS and ZDC calibration create ROOT files to the local folder, should be disabled. Asked Chiara to take care.

      Failures reading CCDB objects at high rate.

      Issues currently lacking manpower, waiting for a volunteer:

      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
      • Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

      Speeding up startup of workflow on the EPN:

      • Next step: Measure startup times in online partition controlled by AliECS, and compare to standalone measurement of 18 seconds for process startup.
      • RC will take large SYNTHETIC runs with all detectors and as many EPNs as possible this week, and repeat startup of the same partition few times, so we can investigate the times from infologger state with some statistics.

      EPN Scheduler

      • Problem with no jobs running due to JDL matching problem after resubmission was a vobox bug, fixed by Max.
      • Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.

      Important framework features to follow up:

      • Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
      • Backpressure reporting when there is only 1 input channel: no progress.
      • Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
      • Performance problem in raw proxy solved together with Alexey. Now using 2 io threads, at the expense of additional CPU load. Still not clear why a single thread cannot do more than 1.7 GB/s on the EPN. Should be investigated on FMQ side.

      Minor open points:

      • https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
      • https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
      • https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
      • https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
      • https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
      • https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
      • DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
      • Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
      • Support in DPL GUI to send individual START and STOP commands.
      • Assigned JIRA tickets that were left for Matthias to Giulio (sorry :))

      Non-critical QC tasks:

      • Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.

      Reconstruction performance:

      • Async performance:
        • Fixes for TPC QC and AOD working, testing the 1 NUMA config.
        • Unfortunately spotted 2 new problems: 
          • Async workflow gets stuck when I use the tuned multiplicites. Giulio and me are checking. https://alice.its.cern.ch/jira/browse/O2-3399
          • I can work around that by disabling AOD writing (not clear why). But then we get oscilations with TF rate limiting, and removing bottlenecks only amplifies the oscilations. Still checking, but I do not manage to get 100 % CPU load.
        • Performance status so far: 3.45 s per TF in 4GPU setup, compared to 3.55s in 4 * 1GPU setup, on LHC22f data.
          • average EPN CPU load at ~40%. The workflow could use 50%, so there is 25% margin for improvement.
          • AOD production, which was disabled, would use ~3%, so the net margin is 7/43 = 16%, i.e. speed of light is 2.97s per TF.
        • Doing Pb-Pb benchmarks now (using plain dpl-workflow.sh without extra env variables, since we use MC data)
      • Opened a JIRA ticket for EPN to follow up the interface to change SHM memory sizes when no run is ongoing (which was requested 1 year ago). Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250

      Topology generation:

      • Should change dpl-workflow script to fail if any process in the dpl pipe (workflow | workflow | ...) has non-zero exit code.

      TPC DLBZS / SYNTHETIC runs

      • TPC has requested a small format change, which is not backward compatible, to make it easier to get timing in the FPGA.
      • Will implement the software side, but we have to replace all DLBZS data sets, particularly for SYNTHETIC runs.

      O2 versions at P2:

      • Software update yesterday. Unfortunately many problems in many places. 1 caused by us due to inteference of TRG-->CTP translation in toplogy script and changes due to new ECS workflow panel. Fixed by Ole.
      • Deployed ECS workflow panel update. Can now disable QC / CALIB for individual detectors, instead of providing comma-separated list of detectors to be enabled.

      Other SRC items:

       

      QC / Monitoring / InfoLogger updates:

      • TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
      • No issues with QC log file writing so far, seems we have fixed all problems.
      • Commented on https://alice.its.cern.ch/jira/browse/OLOG-59 what we had discussed last time:
        • ALARM / IMPORTANT should be support level in any case.
        • FATAL / ERROR could be Ops or support level. To be decided by RC. But if they are support level, the QC shifter would need to check.
        • After discussion with RC, they would prefer if checks for e.g. corrupt data could be coalesced, and a single warning to Ops is emitted to check data quality.

      Testing Pb-Pb workflow at P2:

      • Creating SYNTHETIC data set with low IR Pb-Pb data, to test the Pb-Pb workflow we want to run (which will exceptionally include MUON and DEDX, and full ITS processing). To be deployed today.
    • 11:20 11:40
      Software for Hardware Accelerators 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      General:

      • Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.
        • Giulio has opened a PR for GCC 12.2 with fixes for GPU in our local branch. Afterwards we can bump to arrow 10 and then LLVM 15 (which won't require any additional fixes for GPU).

      ROCm compilation issues:

      • Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
        • Matteo implemented a workaround for the LOG(...) problem, so we can now at least use the LOG macro in the ROCm code. But the internal compiler error is not yet fixed, so it may come back.
      • Another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.

      ITS GPU Tracking and Vertexing:

      • Matteo will spend 1 week working on multi-threading of ITS vertexing, then go back to GPU ITS tracking.

      TPC GPU Processing

      • Felix fixed problem in clusterization which gave different results between CPU and GPU version.
      • Random GPU crashes under investigation.
      • TPC CTF Skimming:
        • Finalized eta / z check for unattached clusters.
        • Bz / tf length metainfo stored with CTF, switched to tree dictionary (thx to Ruben) to have schema evolution, fixed schema evolution in CTF headers (together with Ruben).
        • Still possible improvements:
          • DCA cut for clusters attached to tracks.
          • Need to take into account TPC distortions for z / eta check. Either with some margin, or assuming some average distortion corrections.

      TRD Tracking

       

      ANS Encoding

      • Waiting for PR with AVX-accelerated ANS encoding