Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-10-26T11:00:00+02:00
End: 2022-10-26T12:20:00+02:00
Location: No location set

11:00 → 11:20

PDP System Run Coordination 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

EOS handling / START-STOP-START:

Tests with START / STOP / START planned with the FLP team today at 2pm (goal is to identify detector workflow related issues and communicate them to the detector teams)
- https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.

Problem at EOR and with calib workflows:

https://alice.its.cern.ch/jira/browse/O2-3315

GPU ROCm Issues:

Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.

Full system test:

Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
- Should be highest priority in DPL development, once slow topology generation, EoS timeout, START-STOP-START have been fixed.
- Possible workaround: Could implement a special topology with 2 devices in a chain (same number of pipeline levels, so we have basically 2 devices per GPU.) The first would extract all relative offsets of TPC data from the SHM buffers, and send them via an asynchronous non-DPL channel to the actual processing device. In that way, the processing could start timeframe n+1 before it returned the result for timeframe n to DPL. When DPL starts the doProcessing function, the GPU would already be processing that TF, and just return it when it is ready. Development time should be no more than 3 exclusive days.

Global calibration status:

Problem with TPC SCD calib performance confirmed by Ole, but not yet understood. Work in progress. Also interplay between branches of calibration chain are under investigation.
TPC IDC calibration:
- New O2 version deployed today, which coalesces multiple TFs in a single message, hopefully working around the perfomance limitation in the dpl-raw-proxy. To be tested.
- Problem with TPC SAC workflow. Not receiving data on calib node from FLPs.

Failures reading CCDB objects at high rate.

Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

Speeding up startup of workflow on the EPN:

Next step: Measure startup times in online partition controlled by AliECS, and compare to standalone measurement of 18 seconds for process startup.

EPN Scheduler

Started to have jobs running regularly on the nodes that are available (O(20) nodes). Problem yesterday, since some nodes were cycled in and out of the online zone, and we didn't get the exact same set of nodes back. It turned out the async ansible role was not applied to all nodes but only to those in the async partition, so some nodes had an incorrect configuration and failed the grid jobs. (Thx to Johannes for investigation!)
Need a procedure / tool to move nodes quickly between online / async partition. EPN working on this. Currently most EPNs are still usually in online, and we have to ask to get some in async. Should arrive at a state where all EPNs that are not needed are in async by default.
Discussion on open file handles / inodes and on memory usage on Monday. Minutes here: https://indico.cern.ch/event/1210850/

Important framework features to follow up:

Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress.
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710

Minor open points:

https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.

Non-critical QC tasks:

Failing QC tasks should not crash the run. On the FairMQ level, these tasks are behind a dispatcher, and get only a copy of the data via an unreliable channel, so they cannot produce backpressure. EPN and FLP are implementing a flag for non-critical tasks in ECS and DDS, so that a task failure will not stop the run.
One problem though might be: If the tasks receive CCDB objects via the DPL CCDB fetcher, this will probably use the normal forwarding of the messages from one process to another, so if one QC process dies, it will stall the entire processing chain. Is that correct?
If yes I think it is a design problem, how can we fix this?

Online Reconstruction performance with pp:

3 MHz: With the current data size, without DLBZS and without clusterizer tuning, there is no way we can reasonably reconstruct this data. (It should work with extensive tuning, but in my opinion that would be a waste of time).
2 MHz: Processing was too slow, and despite backpressure we still hit >50% CPU load on the EPNs, so only the Hyperthreaded part was left anyway, so not much resources left. Muon reco was disabled, and next attempt would be to disable QC in addition. It might be possible to run a part of the QC, but we could not test further due to beam dump.
500 kHz with fewer EPNs: We managed to get it running on 100 EPNs, with ~45% CPU load, and ~20 GB (out of 512) of memory left. Note that the 500 kHz config has the extra reconstruction steps (MUON, DEDX) enabled. I think this is as far as we can go, and it is already on the edge, but was running for 1h without issues, so it should be stable.
- Until we have DLBZS and clusterizer tuning, I don't see the possibility to go below 100 EPNs.
- If we want to go significantly below 100 EPNs, we would need to change the SHM sizes, which would mean restarting the SHM tool with different parameters. In principle that is possible, but it would mean we have to drain the farm and restart the tool cluster-wide every time we switch between 500 kHz with few EPNs setup, and any of the other setups.

Topology generation:

Added feature to exclude detectors from QC and Calib, as asked by Run Coordination

O2 versions at P2:

O2 on EPNs with CTP workflow running now, FLP CTP workflow will only be deployed in few weeks.
Status of CTP in GRP?

Other SRC items:

QC / Monitoring updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
QC log file writing disabled again, since multiple issues were found: https://alice.its.cern.ch/jira/browse/QC-882.

TPC CTF Skimming:

Discussed approach with Ruben. Work ongoing.

11:20 → 11:40

Software for Hardware Accelerators 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

General:

Problem on MI100 / MI210 not solved by new ROCm beta version. AMD can reproduce the problem and is investigating.
Locally tested OpenCL compilation with Clang 14 bumping –cl-std from clc++ (OpenCL 2.0) to CLC++2021 (OpenCL 3.0) and using clang-internal SPIR-V backend. Arrow bump to 8.0 done, which was prerequesite.

ITS GPU Tracking and Vertexing:

Matteo will spend 1 week working on multi-threading of ITS vertexing, then go back to GPU ITS tracking.

TPC GPU Processing

Found bug in CPU version of GPU TPC ZS v4 (DLBZS) decoder, reported to Felix.
Investigation “random crashes” in TPC tracking: Had a look at all GPU memory fault error messages in the last 60 days: have to distinguish 2 cases:
- Correlated crashes: gpu-reconstruction crashes on many nodes at more or less the same time. All of the cases are most certainly caused by corrupt raw data. In each case there are plenty of warnings for corrupt TPC meta data at the same time.
  - Note, that
    - In order to recover half-broken time frames, if possible also TFs with broken meta info is processed by the GPU.
    - Not all cases of broken meta information is detected ahead of time, i.e. meta-data corruption can lead to a GPU crash without prior warning message.
  - Several of these cases were also TPC standalone tests and not global runs.
- Single crashes, where just one gpu-reconstruction process segfaults (on a longer timescale of at least one hour). Much more difficult to investigate:
  - Often accompanied by a warning for corrupt meta data by that process before, so same as above but only a single TF is corrupted.
  - Can be caused by bad gpu-memory, if it happens on the same gpu many times, particularly when it happens in the FST where the TF is known to be good. One has to collect statistics which GPUs fail often and replace them. But this has become pretty rare right now (e.g. epn199).
    - Started to create some statistics and discussing with Dirk from EPN.
  - There are several cases of only a single crash in a run, on a GPU which never crashed again. Not clear what is the cause.
    - Unlikely that the GPU is broken.
    - Likely that it is corrupt TPC metadata, since the checks that show error messages cannot detect all kinds of corruption.
    - Could of course also be a race-condition, or other bug in the code. But cannot reproduce the problem, we store only 1 permille of raw data (and none for the TPC standalone test), and we don’t have the raw files which crashed.

Unfortunately, cannot do anything more right now. Should fix the metadata corruption first, and replace GPUs that fail regularly. Only afterwards I would try to check for race-conditions / software bugs.

TRD Tracking

ANS Encoding

Waiting for PR with AVX-accelerated ANS encoding

11:40 → 12:00

Monitoring Discussion 20m

Speakers: Adam Wegrzynek (CERN), Massimo Lamanna (CERN)

Remove _field hostname (Adam for telegraf, Massimo and Adam for dashboards CR0 and CR1 respectively)
Reduce DPL retention 3days on EPN (Massimo)
Test perf and impact in changing dataprocessor_id in just an id to distiguish stages in a pipeline (or move id to _field?).
Prototype round-robin of DPL buckets with run number (impact on dashboard? how to make it invisible to users?)
- remove runnumber as tag, have SOR / EOR and use time?
- split large bucket in partitions, e.g. 5*50 runs instead of 250 runs?
For historical data, <mean> to rebin data
- Since this will not reduce the cardinality of the long-term bucket, the influDB host will waste resources...
- IMO the latter is better (no query using the id indexing)
- Impact on dashboards?