Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-10-25T11:00:00+02:00
End: 2023-10-25T12:20:00+02:00
Location: No location set

11:00 → 11:20

Discussion 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

High priority framework topics:

Problem with EndOfStream / Dropping Lifetime::Timeframe data
- Together with Giulio, we debugged and fixed plenty of bugs, and apparently these 2 major topics had partially the same root causes:
  - NewRun detection in input proxy was broken.
  - 0xCCDB message was send with incorrect runNumber / startTime.
  - DPL was sending 0xCCDB message also when there was no output.
  - With output detection added, devices that do not use DPL to send messages (input proxy, raw tf reader, raw file reader) were not triggering the 0xCCDB message, since DPL was not aware of the output. Changed them to notify DPL that there was an output.
  - EndOfStream counter in input proxy was only taking into account one channel.
  - Bogus messages with runNumber 0 caused input-proxy to send EoS immediately (which we received due to problem with runNumber in 0xCCDB message).
  - EoS counter in input proxy did not work correctly if there was no data received before the EoS.
- With all these fixes, the majority of the error flood in staging and in SYNTHETIC runs on production are gone, but we are still not done:
  - The EndOfStream mechanism should be fully working now, but by design it cannot handle the case where either an EPN fails, but the FMQ channel is not yet closed so the input proxy will wait for EoS indefinitely, or when the processing after EoS exceeds the 20s grace period of DPL.
    - Both these issues can only be fixed once we add an additional state to the state machine (draft still work in progress).
  - We still see some "Dropping lifetime::timeframe" messages in SYNTHETIC / COSMIC runs. I believe these are now real errors in user code, which misses to send output in some cases. Asked Giulio to add sanity checks in DPL for such cases, otherwise it is extremely difficult to identify the ROOT cause: https://alice.its.cern.ch/jira/browse/O2-4284 https://alice.its.cern.ch/jira/browse/O2-4283
  - In async reco tests on the EPNs, I have seen such messages as well, and this led to up to 2% of TFs not reaching the AOD writer: https://alice.its.cern.ch/jira/browse/O2-4291
    - Chiara checked that this apparently doesn't happen on the GRID, must be timing related.
    - Will add check for this to the async reco validation scripts.
    - Must be investigated further.
Fix START-STOP-START for good
- https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
- We need to add functionality to detect the new run, and then DPL must reset the counters for injected timeframe timeslice number, and for all oldestPossibleTimeframe counters. https://alice.its.cern.ch/jira/browse/O2-4293
Problem with QC topologies with expendable tasks- For items to do see: https://alice.its.cern.ch/jira/browse/QC-953 - Status?
Problem in QC where we collect messages in memory while the run is stopped: https://alice.its.cern.ch/jira/browse/O2-3691
- Tests ok, will be deployed after HI and then we see.
Switch 0xdeadbeef handling from on-the-fly creating dummy messages for optional messages, to injecting them at readout-proxy level.
- After Pb-Pb, need to change FLP workflows and all detector workflows on EPN.
New issue: sometimes CCDB populator produces backpressure, without processing data. Crashed several Pb-Pb runs yet: https://alice.its.cern.ch/jira/browse/O2-4244
- Disappeared after disabled CPV gain calib, that was very slow. However, this can only have hidden the problem. Apparently there is a race condition that can trigger a problem in the input handling, which makes the CCDB populator stuck. Since the run funciton of the CCDB populator is not called and it does not have a special completion policy, but simply consumeWhenAny, this is likely to be a generic problem.
- Cannot be debugged Pb-Pb right now, since it is mitigated. But must be understood afterwards.

Other framework tickets:

TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681
Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress: https://alice.its.cern.ch/jira/browse/O2-4237
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
DPL sending SHM metrics for all devices, not only input proxy: https://alice.its.cern.ch/jira/browse/O2-4234
Some improvements to ease debugging: https://alice.its.cern.ch/jira/browse/O2-4196 https://alice.its.cern.ch/jira/browse/O2-4195 https://alice.its.cern.ch/jira/browse/O2-4166
After Pb-Pb, we need to do a cleanup session and go through all these pending DPL tickets with a higher priority, and finally try to clean up the backlog.

Global calibration topics:

TPC IDC workflow problem.
TPC has issues with SAC workflow. Need to understand if this is the known long-standing DPL issue with "Dropping lifetime::timeframe" or something else.
Even with latest changes, difficult to ensure guaranteed calibration finalization at end of global run (as discussed with Ruben yesterday).
- After discussion with Peter, Giulio this morning: We should push for 2 additional states in the state machine at the end of run between RUNNING and READY:
  - DRAIN: For all but O2, the transision RUNNING --> FINALIZE is identical to what we do in STOP: RUNNING --> READY at the moment. I.e. no more data will come in then.
    - O2 could finalize the current TF processing with some time out, where it stops processing incoming data, and at EndOfStream trigger the calibration postprocessing.
  - FINALIZE: No more data is guaranteed to come in, but the calibration could still be running. So we leave FMQ channels open, and have a time out to finalize the calibration. If input proxies have not yet received the EndOfStream, they will inject it to trigger the final calibration.
- This would require changes in O2, DD, ECS, FMQ, but all changes except for in O2 should be trivial, since all other components would not do anything in these states.
- Started to draft a document, but want to double-check it will work out this way before making it public.
Problem with endOfStream in the middle of a run, stopping calib processing most likely fixed.
- Happened again few times and killed runs.
- (I could not reproduce the exact problem on staging, but something similar, which is now fixed.)
- There were 2 independent issues:
  - EndOfStream messages carry runNumber = 0, which triggers the newRun flag in the readout proxy, resettting the peer counters, so the check if the eos has arrived from all peers will always be true.
  - When a process crashes online, ODC takes down the remaining processes in that collection, which triggered an end of stream from the readout proxy during the STOP transition. This is no suppressed if running online and if the endOfStream counting criterion from all peers is not met.
- Above patches fully fixed it, didn't occur since then.

CCDB:

Bump to libwebsockets 4.x / JAliEn-ROOT 0.7.4: Status? Costin asked to check in async production before merging.

Sync reconstruction / Software at P2:

Now at SW version .38, have .41 available with the fixes for EoS and Deopping Lifetime::Timeframe, but would only be used if needed for data taking. Current failures are infrequent, so RC would not like to update (despite it currently floods the infologger...)
HMPID raw decoding issues still pending.
ITS TPC matching QC deployed.
Added detector scatch folder under "/scratch/services/detector_tmp/"

CTF Size:

New CTF coding scheme deployed and working.

Async reconstruction

Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes.
- Checking 2 things: does the situation get better without GPU monitoring? --> Inconclusive
- We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
Performance issue seen in async reco on MI100, need to investigate.
- Could not find a problem on MI100 nodes, however some nodes were slow since reading the CTFs from EOS was slow. Asked GRID and EPN experts to have a look, but no issue on processing side.
Started to tune async reco on EPNs for Pb-Pb:
- HMPID matching, TOF matching and MCH clustering are extremely slow and single-threaded. Even if not using high CPU load, this prolongs the lifetime of TFs in flight, and since we cannot increase it due to memory, other processes are stalling. This leads to problems achieving good CPU utilization.
For run 544490 (47 kHz) found a setting that yields ~80 % CPU utlization (which is close to 100% capacity considering HyperThreading), and neede ~4s per TF on an EPN (2 * 1NUMA workflow).
Still, this number is not reliable:
- Ruben found severe problems with ITS tracking and secondary vertexing, which abort early, or are much faster in high occupancy data.
- This must be fixed, and could increase the processing time significantly.

EPN major topics:

Fast movement of nodes between async / online without EPN expert intervention.
- 2 goals I would like to set for the final solution:
  - It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
  - We must not lose which nodes are marked as bad while moving.
Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
- Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
New problem seen in 3 runs 544300, 544305, and 544384. There were errors before on the FLPs, which perhaps caused this behavior, but what is then seen: Many EPNs stop receiving TFs, the remaining EPNs cannot handle the rate and create backpressure. No clear error is printed to InfoLogger. Runs were eventually stopped due to the backpressure, while it was not understood why the processing created backpressure, which was simply due to too high rate to remaining EPNs.
- Not clear what the ROOT cause is, could be a SYMPTOM of a prior problem on FLPs, or a connectivity issue?
- Possible mitigation would be at least to shut down TfBuilder that do not receive data any more, such that bad nodes count against nmin.

Other EPN topics:

Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
Create user for epn2eos experts for debugging: https://alice.its.cern.ch/jira/browse/EPN-383
EPNs sometimes get in a bad state, with CPU stuck, probably due to AMD driver. To be investigated and reported to AMD.

TPC Raw decoding checks:

Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.

Full system test issues:

Topology generation:

Should test to deploy topology with DPL driver, to have the remote GUI available. Status?

QC / Monitoring / InfoLogger updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.

AliECS related topics:

Extra env var field still not multi-line by default.

GPU ROCm / compiler topics:

Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
New compiler regression in ROCm 5.6, need to create testcase and send to AMD.
ROCm 5.7 released, didn't check yet. AMD MI50 will go end of maintenance in Q2 2024. Checking with AMD if the card will still be supported by future ROCm versions.

TPC GPU Processing

Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
Online runs at low IR / low energy observe weird number of clusters per track statistics.
- Problem was due to incorrect vdrift, though it is not clear why this breaks tracking so badly, being investigated.
Ruben reported an issue with global track refit, which some times does not produce the TPC track fit results. To be investigated.
Robert reported a problem in tracking of Laser runs triggering buffer overflow protection on GPU. Probably due to different local occupancy, leading to unsuited number of maximum seeds / tracks estimated. TPC took some laser raw data to check locally.
- Problem was due to bogus value in TPC transformation map.
  - Transformation map disabled for laser calib, since it anway should not be used there.
  - Sergey / Alex are checking the map, since this problem also affects the tracking of real data.
Fixed a bug where GPU Standalone Multi-Threading would segfault when shutting down the environment, since the asynchronous thread was still accessing the FMQ device before stopping.

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

11:20 → 11:25

TRD Tracking 5m

Speaker: Ole Schmidt (CERN)

11:25 → 11:30

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

11:30 → 11:35

ITS Tracking 5m

Speaker: Matteo Concas (CERN)

Hybrid GPU tracking:
- Lightweight (no propagator, no material but manual correction) fitting now implemented, fitting only: sorting+filtering still to be ported.
- Tested on nvidia, it works.
- Validating results, currently checking CPU to work with no Material budget (and no propagator) for more fair comparison is done properly.
- Number of tracks similar (pp data 1 TF: e.g. 108592 vs 108400), pT as well (?), chi2 is wrong. To take with a pinch of salt (red: gpu, blue: cpu).

Speedup? No tuning yet, but e.g. 20 blocks, 512 threads (~5):
- [1236858:its-tracker]: [10:55:00][INFO] - Hybrid Track finding completed in: 246.244 ms (GPU: no material + correction)
- [1237359:its-tracker]: [10:55:55][INFO] - Track finding completed in: 1272 ms (CPU:no material + manual correction)
TODO: discuss PR, check chi2, optimise code (current is 1:1 CPU/GPU, some "if"s can be traded), ...

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC